I try to scrape the ASIN numbers on Amazon. Please note that this is not about the product details (like this: https://www.youtube.com/watch?v=qRVRIh3GZgI), but this is when you search for a keyword (in this example "trimmer", try this:
https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2). The results are many products, I am able to scrape all the Titles.
What is not visible is the ASIN (which is a unique Amazon number). I saw, while inspecting the HTML a link in the text (href), which is containing the ASIN number. In the example below, the ASIN = B01MSHQ5IQ
<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&qid=1554462204&s=gateway&sr=8-3">
Ending with my question: How can I retrieve all the Product Titles AND ASIN numbers on the page? For example:
Number Title ASIN
1 Braun, Beardtrimmer B07JH1LLYR
2 TNT Pro Series Waist B00R84J2PK
... ... ...
So far, I am using scrapy (but also open for other Python solutions) and I am able to scrape the Titles.
My code so far:
First run in the command line:
scrapy startproject tutorial
Then, adjust the files in the Spider (see example 1) and items.py (see example 2).
Example 1
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"
]
## scrapy crawl AmazonDeals -o Asin_Titles.json
def parse(self, response):
items = AmazonItem()
Title = response.css('.a-text-normal').css('::text').extract()
items['title_Products'] = Title
yield items
As requested by #glhr, adding the items.py code:
Example 2
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Products = scrapy.Field()
You can get the link to the product by extracting the href attribute of <a class="a-link-normal a-text-normal" href="...">:
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
From a link, you can use a regular expression to extract the ASIN number from the link:
(?<=dp/)[A-Z0-9]{10}
The regular expression above will match 10 characters (either uppercase letters or numbers) preceded by dp/. See demo here: https://regex101.com/r/mLMv3k/1
Here's a working implementation of the parse() method:
def parse(self, response):
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
Title = response.css('span.a-text-normal').css('::text').extract()
# for each product, create an AmazonItem, populate the fields and yield the item
for result in zip(Link,Title):
item = AmazonItem()
item['title_Product'] = result[1]
item['link_Product'] = result[0]
# extract ASIN from link
ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
item['ASIN_Product'] = ASIN
yield item
This requires extending AmazonItem with new fields:
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Product = scrapy.Field()
link_Product = scrapy.Field()
ASIN_Product = scrapy.Field()
Sample output:
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
'body, face, nose, and ear hair trimmer, shaver, and clipper'}
Demo: https://repl.it/#glhr/55534679-AmazonSpider
To write the output to a JSON file, simply specify feed export settings in the spider:
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"]
custom_settings = {
'FEED_URI' : 'Asin_Titles.json',
'FEED_FORMAT' : 'json'
}
Related
I am scraping BBC food for recipes. The logic is as follows:
Main page with about 20 cuisines
-> in each cuisine, there's usually ~20 recipes on 1-3 pages for each letter.
-> in each recipe, there is about 6 things I scrape (ingredients, rating etc.)
Therefore, my logic is: get to main page, create request, extract all cuisine links, then follow each, from there extract each page of recipes, follow each recipe link, and from each recipe finally get all data. Note this is not finished yet as I need to implement the spider to also go through all letters.
I would love to have a 'category' column, i.e. for each recipe in the "african cuisine" link have a column that says "african", for each recipe from the "italian cuisine" an "italian" entry in all columns etc.
Desired outcome:
cook_time prep_time name cuisine
10 30 A italian
20 10 B italian
30 20 C indian
20 10 D indian
30 20 E indian
Here is my following spider:
import scrapy
from recipes_cuisines.items import RecipeItem
class ItalianSpider(scrapy.Spider):
name = "italian_spider"
def start_requests(self):
start_urls = ['https://www.bbc.co.uk/food/cuisines']
for url in start_urls:
yield scrapy.Request(url = url, callback = self.parse_cuisines)
def parse_cuisines(self, response):
cuisine_cards = response.xpath('//a[contains(#class,"promo__cuisine")]/#href').extract()
for url in cuisine_cards:
yield response.follow(url = url, callback = self.parse_main)
def parse_main(self, response):
recipe_cards = response.xpath('//a[contains(#class,"main_course")]/#href').extract()
for url in recipe_cards:
yield response.follow(url = url, callback = self.parse_card)
next_page = response.xpath('//div[#class="pagination gel-wrap"]/ul[#class="pagination__list"]/li[#class="pagination__list-item pagination__priority--0"]/a[#class="pagination__link gel-pica-bold"]/#href').get()
if next_page is not None:
next_page_url = response.urljoin(next_page)
print(next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse_main)
def parse_card(self, response):
item = RecipeItem()
item['name'] = response.xpath('//h1[contains(#class,"title__text")]/text()').extract()
item['prep_time'] = response.xpath('//div[contains(#class,"recipe-metadata-wrap")]/p[#class="recipe-metadata__prep-time"]/text()').extract_first()
item['cook_time'] = response.xpath('//p[contains(#class,"cook-time")]/text()').extract_first()
item['servings'] = response.xpath('//p[contains(#class,"serving")]/text()').extract_first()
item['ratings_amount'] = response.xpath('//div[contains(#class="aggregate-rating")]/span[contains(#class="aggregate-rating__total")]/text()').extract()
#item['ratings_amount'] = response.xpath('//*[#id="main-content"]/div[1]/div[4]/div/div[1]/div/div[1]/div[2]/div[1]/span[2]/text()').extract()
item['ingredients'] = response.css('li.recipe-ingredients__list-item > a::text').extract()
return item
and items:
import scrapy
class RecipeItem(scrapy.Item):
name = scrapy.Field()
prep_time = scrapy.Field()
cook_time = scrapy.Field()
servings = scrapy.Field()
ratings_amount = scrapy.Field()
rating = scrapy.Field()
ingredients = scrapy.Field()
cuisine = scrapy.Field()
Note I am saving the output via
scrapy crawl italian_spider -o test.csv
I have read the documentation and tried several things, such as adding the extracted cuisine to a parse_cuisine or parse_main methods, but all yield an error.
There are two ways here. Most common way is to pass some information from one page to another is to use cb_kwargs in your scrapy.Request:
def parse_cousine(self, response):
cousine = response.xpath('//h1/text()').get()
for recipe_url in response.xpath('//div[#id="az-recipes--recipes"]//a[.//h3]').getall():
yield scrapy.Request(
url=response.urljoin(recipe_url),
callback=self.parse_recipe,
cb_kwargs={'cousine': cousine},
)
def parse_recipe(self, response, cousine):
print(cousine)
But one this website you can find it on the recipe page (inside ingredients section after parsing JSON):
def parse_recipe(self, response):
recipe_raw = response.xpath('//script[#type="application/ld+json"][contains(., \'"#type":"Recipe"\')]/text()').get()
recipe = json.loads(recipe_raw)
cousine = recipe['recipeCuisine']
Update This XPath '//script[#type="application/ld+json"][contains(., \'"#type":"Recipe"\')]/text()' finds script node that have type attribute with a value application/ld+json and also contains string "#type":"Recipe" in a text of that node.
I am scraping some news website with scrapy framework, it seems only store the last item scraped and repeated in loop
I want to store the Title,Date,and Link, which i scrape from the first page
and also store the whole news article. So i want to merge the article which stored in a list into a single string.
Item code
import scrapy
class ScrapedItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
source = scrapy.Field()
date = scrapy.Field()
paragraph = scrapy.Field()
Spider code
import scrapy
from ..items import ScrapedItem
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
box_text = response.xpath("//ul/li/div[#class='ket']")
items = ScrapedItem()
for crawl in box_text:
title = crawl.css("h1 a::text").extract()
source ="https://investasi.kontan.co.id"+(crawl.css("h1 a::attr(href)").extract()[0])
date = crawl.css("span.font-gray::text").extract()[0].replace("|","")
items['title'] = title
items['source'] =source
items['date'] = date
yield scrapy.Request(url = source,
callback=self.parseparagraph,
meta={'item':items})
def parseparagraph(self, response):
items_old = response.meta['item'] #only last item stored
paragraph = response.xpath("//p/text()").extract()
items_old['paragraph'] = paragraph #merge into single string
yield items_old
I expect the output that the Date,Title,and Source can be updated through the loop.
And the article can be merged into single string to be stored in mysql
I defined an empty dictionary and put those variables within it. Moreover, I've brought about some minor changes in your xpaths and css selectors to make them less error prone. The script is working as desired now:
import scrapy
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
for crawl in response.xpath("//*[#id='list-news']//*[#class='ket']"):
d = {}
d['title'] = crawl.css("h1 > a::text").get()
d['source'] = response.urljoin(crawl.css("h1 > a::attr(href)").get())
d['date'] = crawl.css("span.font-gray::text").get().strip("|")
yield scrapy.Request(
url=d['source'],
callback=self.parseparagraph,
meta={'item':d}
)
def parseparagraph(self, response):
items_old = response.meta['item']
items_old['paragraph'] = response.xpath("//p/text()").getall()
yield items_old
I am using Scrapy to scrape some website data. But I can't make the step to get my data properly.
This is the output of my code (see code below):
In the command Line:
scrapy crawl myspider -o items.csv
Output:
asin_product product_name
ProductA,,,ProductB,,,ProductC,,, BrandA,,,BrandB,,,BrandC,,,
ProductA,,,ProductD,,,ProductE,,, BrandA,,,BrandB,,,BrandA,,,
#Note that the rows are representing the start_urls and that the ',,,'
#three commas are separating the data.
Desired output:
scrapy crawl myspider -o items.csv
Start_URL asin_product product_name
URL1 ProductA BrandA
URL1 ProductB BrandB
URL1 ProductC BrandC
URL2 ProductA BrandA
URL2 ProductD BrandB
URL2 ProductE BrandA
My Used Code in Scrapy:
import scrapy
from amazon.items import AmazonItem
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=shoes&ref=nb_sb_noss_2", # This should
be #URL 1
"https://www.amazon.com/s?k=computer&ref=nb_sb_noss_2" # This should
be #URL 2
]
def parse(self, response):
items = AmazonItem()
title = response.xpath('//*[#class="a-size-base-plus a-color-base a-
text-normal"]/text()').extract()
asin = response.xpath('//*[#class ="a-link-normal"]/#href').extract()
# Note that I devided the products with ',,,' to make it easy to separate
# them. I am aware that this is not the best approach.
items['product_name'] = ',,,'.join(title).strip()
items['asin_product'] = ',,,'.join(asin).strip()
yield items
First of all, it's recomended to use css when querying by class.
Now to your code:
The product name is within the a tag (product url). So you can iterate though the links and store the URL and the title.
<a class="a-link-normal a-text-normal" href="/adidas-Mens-Lite-Racer-Running/dp/B071P19D3X/ref=sr_1_3?keywords=shoes&qid=1554132536&s=gateway&sr=8-3">
<span class="a-size-base-plus a-color-base a-text-normal">Adidas masculina Lite Racer byd tĂȘnis de corrida</span>
</a>
You need to create one AmazonItem object per line on your csv file.
def parse(self, response):
# You need to improve this css selector because there are links which
# are not a product, this is why I am checking if title is None and continuing.
for product in response.css('a.a-link-normal.a-text-normal'):
# product is a selector
title = product.css('span.a-size-base-plus.a-color-base.a-text-normal::text').get()
if not title:
continue
# The selector is already the a tag, so we only need to extract it's href attribute value.
asin = product.xpath('./#href').get()
item = AmazonItem()
item['product_name'] = title.strip()
item['asin_product'] = asin.strip()
yield item
Make the start_url available in parse method
instead of using start_urls you can yield your initial requests from a method named start_requests (see https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=start_requests#our-first-spider).
With each request you can pass the start url as meta data. This meta data is then available within your parse method (see https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=meta#scrapy.http.Request.meta).
def start_requests(self):
urls = [...] # this is equal to your start_urls
for start_url in urls:
yield Request(url=url, meta={"start_url": start_url})
def parse(self, response):
start_url = response.meta["start_url"]
yield multiple items, one for each product
Instead of joining titles and brands you can yield several items from parse. For the example below i assume the lists title and asin have the same length.
for title, asin in zip(title, asin):
item = AmazonItem()
item['product_name'] = title
item['asin_product'] = asin
yield item
PS: you should check amazons robots.txt. They might not allow you to scrape their site and ban your IP (https://www.amazon.de/robots.txt)
I have a scrapy spider that receives the input of a desired keyword and then yields a search result url. It then crawls that URL to scrape desired values about each car result within 'item'. I am trying to add within my yielded items the url for each full sized car image link that accompanies each car on the vehicle list of results.
The specific url that is being crawled when I enter the keyword as being "honda" is the following:
Honda search results example
I have been having trouble figuring out the correct way to write the xpath and then include whatever list of image url's I acquire into the spider's 'item' I yield at the last part of my code.
Right now when Items is saved to a .csv file with the below lkq.py spider being run with the command "scrapy crawl lkq -o items.csv -t csv" the column of the items.csv file for Picture is just all zeros instead of the image url's.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser
keyword = raw_input('Keyword: ')
url = 'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=%s&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US' % (keyword,)
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
Picture = scrapy.Field()
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
url,
)
def parse(self, response):
picture = response.xpath(
'//href=/text()').extract()
section_color = response.xpath(
'//div[#class="pypvi_notes"]/p/text()').extract()
info = response.xpath('//td["pypvi_make"]/text()').extract()
for element in range(0, len(info), 4):
item = Cars()
item["Make"] = info[element]
item["Model"] = info[element + 1]
item["Year"] = info[element + 2]
item["Entered_Yard"] = info[element + 3]
item["Section"] = section_color.pop(
0).replace("Section:", "").strip()
item["Color"] = section_color.pop(0).replace("Color:", "").strip()
item["Picture"] = picture.pop(0).strip()
yield item
I don't really understand why you were using an xpath like '//href=/text()', I would recommend reading some xpath tutorial first, here is a very good one.
If you want to get all the images urls I think this is what you want
pictures = response.xpath('//img/#src').extract()
Now picture.pop(0).strip() will only get you the last of the urls and strip it, remember that .extract() returns a list, so pictures now contains all the image links, just choose there which ones you need.
I am attempting to scrape the Library of Congress/Thomas website. This Python script is intended to access a sample of 40 bills from their site (# 1-40 identifiers in the URLs). I want to parse the body of each piece of legislation, search in the body/content, extract links to potential multiple versions & follow.
Once on the version page(s) I want to parse the body of each piece of legislation, search the body/content & extract links to potential sections & follow.
Once on the section page(s) I want to parse the body of each section of a bill.
I believe there is some issue with the Rules/LinkExtractor segment of my code. The python code is executing, crawling the start urls, but not parsing or any of the subsequent tasks.
Three issues:
Some bills do not have multiple versions (and ergo no links in the body portion of the URL
Some bills do not have linked sections because they are so short, while some are nothing but links to sections.
Some section links do not contain just section-specific content, and most of the content is just redundant inclusion of prior or subsequent section content.
My question is again, why is Scrapy not crawling or parsing?
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class BillItem(Item):
title = Field()
body = Field()
class VersionItem(Item):
title = Field()
body = Field()
class SectionItem(Item):
body = Field()
class Lrn2CrawlSpider(CrawlSpider):
name = "lrn2crawl"
allowed_domains = ["thomas.loc.gov"]
start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767
]
rules = (
# Extract links matching /query/ fragment (restricting tho those inside the content body of the url)
# and follow links from them (since no callback means follow=True by default).
# Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[#id="content"]')), callback='parse_bills', follow=True),
# Extract links in the body of a bill-version & follow them.
#Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True)
)
def parse_bills(self, response):
hxs = HtmlXPathSelector(response)
bills = hxs.select('//div[#id="content"]')
scraped_bills = []
for bill in bills:
scraped_bill = BillItem() ### Bill object defined previously
scraped_bill['title'] = bill.select('p/text()').extract()
scraped_bill['body'] = response.body
scraped_bills.append(scraped_bill)
return scraped_bills
def parse_versions(self, response):
hxs = HtmlXPathSelector(response)
versions = hxs.select('//div[#id="content"]')
scraped_versions = []
for version in versions:
scraped_version = VersionItem() ### Version object defined previously
scraped_version['title'] = version.select('center/b/text()').extract()
scraped_version['body'] = response.body
scraped_versions.append(scraped_version)
return scraped_versions
def parse_sections(self, response):
hxs = HtmlXPathSelector(response)
sections = hxs.select('//div[#id="content"]')
scraped_sections = []
for section in sections:
scraped_section = SectionItem() ## Segment object defined previously
scraped_section['body'] = response.body
scraped_sections.append(scraped_section)
return scraped_sections
spider = Lrn2CrawlSpider()
Just for the record, the problem with your script is that the variable rules is not inside the scope of Lrn2CrawlSpider because it doesn't share the same indentation, so when alecxe fixed the indentation the variable rules became now an attribute of the class. Later the inherited method __init__() reads the attribute and compiles the rules and enforces them.
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
self._compile_rules()
Erasing the last line had nothing to do with that.
I've just fixed the indentation, removed spider = Lrn2CrawlSpider() line at the end of the script, ran the spider via scrapy runspider lrn2crawl.py and it scrapes, follows links, returns items - your rules work.
Here's what I'm running:
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class BillItem(Item):
title = Field()
body = Field()
class VersionItem(Item):
title = Field()
body = Field()
class SectionItem(Item):
body = Field()
class Lrn2CrawlSpider(CrawlSpider):
name = "lrn2crawl"
allowed_domains = ["thomas.loc.gov"]
start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767
]
rules = (
# Extract links matching /query/ fragment (restricting tho those inside the content body of the url)
# and follow links from them (since no callback means follow=True by default).
# Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[#id="content"]')), callback='parse_bills', follow=True),
# Extract links in the body of a bill-version & follow them.
#Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True)
)
def parse_bills(self, response):
hxs = HtmlXPathSelector(response)
bills = hxs.select('//div[#id="content"]')
scraped_bills = []
for bill in bills:
scraped_bill = BillItem() ### Bill object defined previously
scraped_bill['title'] = bill.select('p/text()').extract()
scraped_bill['body'] = response.body
scraped_bills.append(scraped_bill)
return scraped_bills
def parse_versions(self, response):
hxs = HtmlXPathSelector(response)
versions = hxs.select('//div[#id="content"]')
scraped_versions = []
for version in versions:
scraped_version = VersionItem() ### Version object defined previously
scraped_version['title'] = version.select('center/b/text()').extract()
scraped_version['body'] = response.body
scraped_versions.append(scraped_version)
return scraped_versions
def parse_sections(self, response):
hxs = HtmlXPathSelector(response)
sections = hxs.select('//div[#id="content"]')
scraped_sections = []
for section in sections:
scraped_section = SectionItem() ## Segment object defined previously
scraped_section['body'] = response.body
scraped_sections.append(scraped_section)
return scraped_sections
Hope that helps.