I have a scrapy spider that receives the input of a desired keyword and then yields a search result url. It then crawls that URL to scrape desired values about each car result within 'item'. I am trying to add within my yielded items the url for each full sized car image link that accompanies each car on the vehicle list of results.
The specific url that is being crawled when I enter the keyword as being "honda" is the following:
Honda search results example
I have been having trouble figuring out the correct way to write the xpath and then include whatever list of image url's I acquire into the spider's 'item' I yield at the last part of my code.
Right now when Items is saved to a .csv file with the below lkq.py spider being run with the command "scrapy crawl lkq -o items.csv -t csv" the column of the items.csv file for Picture is just all zeros instead of the image url's.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser
keyword = raw_input('Keyword: ')
url = 'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=%s&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US' % (keyword,)
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
Picture = scrapy.Field()
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
url,
)
def parse(self, response):
picture = response.xpath(
'//href=/text()').extract()
section_color = response.xpath(
'//div[#class="pypvi_notes"]/p/text()').extract()
info = response.xpath('//td["pypvi_make"]/text()').extract()
for element in range(0, len(info), 4):
item = Cars()
item["Make"] = info[element]
item["Model"] = info[element + 1]
item["Year"] = info[element + 2]
item["Entered_Yard"] = info[element + 3]
item["Section"] = section_color.pop(
0).replace("Section:", "").strip()
item["Color"] = section_color.pop(0).replace("Color:", "").strip()
item["Picture"] = picture.pop(0).strip()
yield item
I don't really understand why you were using an xpath like '//href=/text()', I would recommend reading some xpath tutorial first, here is a very good one.
If you want to get all the images urls I think this is what you want
pictures = response.xpath('//img/#src').extract()
Now picture.pop(0).strip() will only get you the last of the urls and strip it, remember that .extract() returns a list, so pictures now contains all the image links, just choose there which ones you need.
Related
I am scraping some news website with scrapy framework, it seems only store the last item scraped and repeated in loop
I want to store the Title,Date,and Link, which i scrape from the first page
and also store the whole news article. So i want to merge the article which stored in a list into a single string.
Item code
import scrapy
class ScrapedItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
source = scrapy.Field()
date = scrapy.Field()
paragraph = scrapy.Field()
Spider code
import scrapy
from ..items import ScrapedItem
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
box_text = response.xpath("//ul/li/div[#class='ket']")
items = ScrapedItem()
for crawl in box_text:
title = crawl.css("h1 a::text").extract()
source ="https://investasi.kontan.co.id"+(crawl.css("h1 a::attr(href)").extract()[0])
date = crawl.css("span.font-gray::text").extract()[0].replace("|","")
items['title'] = title
items['source'] =source
items['date'] = date
yield scrapy.Request(url = source,
callback=self.parseparagraph,
meta={'item':items})
def parseparagraph(self, response):
items_old = response.meta['item'] #only last item stored
paragraph = response.xpath("//p/text()").extract()
items_old['paragraph'] = paragraph #merge into single string
yield items_old
I expect the output that the Date,Title,and Source can be updated through the loop.
And the article can be merged into single string to be stored in mysql
I defined an empty dictionary and put those variables within it. Moreover, I've brought about some minor changes in your xpaths and css selectors to make them less error prone. The script is working as desired now:
import scrapy
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
for crawl in response.xpath("//*[#id='list-news']//*[#class='ket']"):
d = {}
d['title'] = crawl.css("h1 > a::text").get()
d['source'] = response.urljoin(crawl.css("h1 > a::attr(href)").get())
d['date'] = crawl.css("span.font-gray::text").get().strip("|")
yield scrapy.Request(
url=d['source'],
callback=self.parseparagraph,
meta={'item':d}
)
def parseparagraph(self, response):
items_old = response.meta['item']
items_old['paragraph'] = response.xpath("//p/text()").getall()
yield items_old
I try to scrape the ASIN numbers on Amazon. Please note that this is not about the product details (like this: https://www.youtube.com/watch?v=qRVRIh3GZgI), but this is when you search for a keyword (in this example "trimmer", try this:
https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2). The results are many products, I am able to scrape all the Titles.
What is not visible is the ASIN (which is a unique Amazon number). I saw, while inspecting the HTML a link in the text (href), which is containing the ASIN number. In the example below, the ASIN = B01MSHQ5IQ
<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&qid=1554462204&s=gateway&sr=8-3">
Ending with my question: How can I retrieve all the Product Titles AND ASIN numbers on the page? For example:
Number Title ASIN
1 Braun, Beardtrimmer B07JH1LLYR
2 TNT Pro Series Waist B00R84J2PK
... ... ...
So far, I am using scrapy (but also open for other Python solutions) and I am able to scrape the Titles.
My code so far:
First run in the command line:
scrapy startproject tutorial
Then, adjust the files in the Spider (see example 1) and items.py (see example 2).
Example 1
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"
]
## scrapy crawl AmazonDeals -o Asin_Titles.json
def parse(self, response):
items = AmazonItem()
Title = response.css('.a-text-normal').css('::text').extract()
items['title_Products'] = Title
yield items
As requested by #glhr, adding the items.py code:
Example 2
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Products = scrapy.Field()
You can get the link to the product by extracting the href attribute of <a class="a-link-normal a-text-normal" href="...">:
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
From a link, you can use a regular expression to extract the ASIN number from the link:
(?<=dp/)[A-Z0-9]{10}
The regular expression above will match 10 characters (either uppercase letters or numbers) preceded by dp/. See demo here: https://regex101.com/r/mLMv3k/1
Here's a working implementation of the parse() method:
def parse(self, response):
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
Title = response.css('span.a-text-normal').css('::text').extract()
# for each product, create an AmazonItem, populate the fields and yield the item
for result in zip(Link,Title):
item = AmazonItem()
item['title_Product'] = result[1]
item['link_Product'] = result[0]
# extract ASIN from link
ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
item['ASIN_Product'] = ASIN
yield item
This requires extending AmazonItem with new fields:
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Product = scrapy.Field()
link_Product = scrapy.Field()
ASIN_Product = scrapy.Field()
Sample output:
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
'body, face, nose, and ear hair trimmer, shaver, and clipper'}
Demo: https://repl.it/#glhr/55534679-AmazonSpider
To write the output to a JSON file, simply specify feed export settings in the spider:
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"]
custom_settings = {
'FEED_URI' : 'Asin_Titles.json',
'FEED_FORMAT' : 'json'
}
I am using Scrapy to scrape some website data. But I can't make the step to get my data properly.
This is the output of my code (see code below):
In the command Line:
scrapy crawl myspider -o items.csv
Output:
asin_product product_name
ProductA,,,ProductB,,,ProductC,,, BrandA,,,BrandB,,,BrandC,,,
ProductA,,,ProductD,,,ProductE,,, BrandA,,,BrandB,,,BrandA,,,
#Note that the rows are representing the start_urls and that the ',,,'
#three commas are separating the data.
Desired output:
scrapy crawl myspider -o items.csv
Start_URL asin_product product_name
URL1 ProductA BrandA
URL1 ProductB BrandB
URL1 ProductC BrandC
URL2 ProductA BrandA
URL2 ProductD BrandB
URL2 ProductE BrandA
My Used Code in Scrapy:
import scrapy
from amazon.items import AmazonItem
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=shoes&ref=nb_sb_noss_2", # This should
be #URL 1
"https://www.amazon.com/s?k=computer&ref=nb_sb_noss_2" # This should
be #URL 2
]
def parse(self, response):
items = AmazonItem()
title = response.xpath('//*[#class="a-size-base-plus a-color-base a-
text-normal"]/text()').extract()
asin = response.xpath('//*[#class ="a-link-normal"]/#href').extract()
# Note that I devided the products with ',,,' to make it easy to separate
# them. I am aware that this is not the best approach.
items['product_name'] = ',,,'.join(title).strip()
items['asin_product'] = ',,,'.join(asin).strip()
yield items
First of all, it's recomended to use css when querying by class.
Now to your code:
The product name is within the a tag (product url). So you can iterate though the links and store the URL and the title.
<a class="a-link-normal a-text-normal" href="/adidas-Mens-Lite-Racer-Running/dp/B071P19D3X/ref=sr_1_3?keywords=shoes&qid=1554132536&s=gateway&sr=8-3">
<span class="a-size-base-plus a-color-base a-text-normal">Adidas masculina Lite Racer byd tĂȘnis de corrida</span>
</a>
You need to create one AmazonItem object per line on your csv file.
def parse(self, response):
# You need to improve this css selector because there are links which
# are not a product, this is why I am checking if title is None and continuing.
for product in response.css('a.a-link-normal.a-text-normal'):
# product is a selector
title = product.css('span.a-size-base-plus.a-color-base.a-text-normal::text').get()
if not title:
continue
# The selector is already the a tag, so we only need to extract it's href attribute value.
asin = product.xpath('./#href').get()
item = AmazonItem()
item['product_name'] = title.strip()
item['asin_product'] = asin.strip()
yield item
Make the start_url available in parse method
instead of using start_urls you can yield your initial requests from a method named start_requests (see https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=start_requests#our-first-spider).
With each request you can pass the start url as meta data. This meta data is then available within your parse method (see https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=meta#scrapy.http.Request.meta).
def start_requests(self):
urls = [...] # this is equal to your start_urls
for start_url in urls:
yield Request(url=url, meta={"start_url": start_url})
def parse(self, response):
start_url = response.meta["start_url"]
yield multiple items, one for each product
Instead of joining titles and brands you can yield several items from parse. For the example below i assume the lists title and asin have the same length.
for title, asin in zip(title, asin):
item = AmazonItem()
item['product_name'] = title
item['asin_product'] = asin
yield item
PS: you should check amazons robots.txt. They might not allow you to scrape their site and ban your IP (https://www.amazon.de/robots.txt)
I am trying to scrape the underlying data on the table in the following pages: https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries
What I want to do is access the underlying link for each row, and capture:
The ID tag (e.g. QDE001),
The name
The reason for listing / additional information
Other linked entities
This is what I have, but it does not seems to be working, I keep getting a "NotImplementedError('{}.parse callback is notdefined'.format(self.class.name)).I believe that the Xpaths I have defined are OK, not sure what I am missing.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class UNSCItem(scrapy.Item):
name = scrapy.Field()
uid = scrapy.Field()
link = scrapy.Field()
reason = scrapy.Field()
add_info = scrapy.Field()
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = [
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]
rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')
def data_extract(self, response):
item = UNSCItem()
name = response.xpath('//*[#id="content"]/article/div[3]/div//text()').extract()
uid = response.xpath('//*[#id="content"]/article/div[2]/div/div//text()').extract()
reason = response.xpath('//*[#id="content"]/article/div[6]/div[2]/div//text()').extract()
add_info = response.xpath('//*[#id="content"]/article/div[7]//text()').extract()
related = response.xpath('//*[#id="content"]/article/div[8]/div[2]//text()').extract()
yield item
Try the below approach. It should fetch you all the ids and corresponding names from all the six pages. I suppose, the rest of the fields you can manage yourself.
Just run it as it is:
import scrapy
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def parse(self, response):
for item in response.xpath('//*[contains(#class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(#class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(#class,"views-field-title")]//span[#dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
I'm trying to extract json data with Scrapy from a website, but i'm facing some issues, like when i run my spider, gives no error and says that crawled 0 pages. I also use the command to store de output to json file to see the output.
The following code is my spider:
import scrapy
class WineSpider(scrapy.Spider):
name = "SpidyWine"
i = 0
url = 'https://maiscarrinho.com/api/search?q=vinho&pageNumber=%s&pageSize=10'
start_urls = [url % 1]
def parse(self, response):
data = json.loads(response.body)
for item in data['results']:
yield {
'Image': item.get('image')
}
if data['Image']:
i = i + 1
yield scrapy.Request(self.url % i, callback=self.parse)
And my class of items:
import scrapy
class MaiscarrinhoItem(scrapy.Item):
image = scrapy.Field()
price = scrapy.Field()
supermarket = scrapy.Field()
promotion = scrapy.Field()
wineName = scrapy.Field()
brand = scrapy.Field()
For now, i'm just using the Image field in my spider to get things more easier.
Also, my ideia when i wrote the if statement in my spider was to 'deal' with the infinite scorlling, when the json api has 'Image' means that that page have content.
Output in Console
Thanks in advance
You did everything right except a very small mistake.
The field name which contains the image is Image and not image
Try :
yield {
'Image': item.get('Image')
}
There is probably something also wrong with your ITEM_PIPELINES in settings.py file
Well answering to my question and after digging into my code after some time... I realized it was about identation errors and some errors of syntaxe.
Another point was the pipeline, i forgot to change de last name to the real name of my pipeline, so instead of having 'Maiscarrinho.pipelines.SomePipeline': 300 now i have 'Maiscarrinho.pipelines.MaiscarrinhoPipeline': 300
The bellow code are extracting the images like i want, but there is one problem yet. Since the page have infinite scrolling i have a condition to evaluate if there is an element named 'Image but for some reason i'm not getting the desired result. It should extract 40 pages each with 10 images.
import scrapy
import json
class WineSpider(scrapy.Spider):
name = "SpidyWine"
url = 'https://maiscarrinho.com/api/search?q=vinho&pageNumber=%s&pageSize=10'
start_urls = [url % 1]
i = 1
def parse(self, response):
data = json.loads(response.body.decode('utf-8'))
for item in data['results']:
yield {
'Image': item.get('Image')
}
if item.get('Image'):
WineSpider.i += 1
yield scrapy.Request(self.url % WineSpider.i, callback=self.parse)