multiple requests to make an item in scrapy - python

I am new to scrapy and I've come across a complicated case.
I have to make 3 get requests in order to make Product items.
product_url
category_url
stock_url
First, I need a request to product_url and a request to category_url to fill out the fields of Product items. I then need to refer to stock_url's response to determine whether to save or discard the created items.
Here's what I'm doing right now:
In my spider,
def start_requests(self):
product_url = 'https://www.someurl.com/product?'
item = ProductItem()
yield scrapy.Request(product_url, self.parse_products, meta={'item':item})
def parse_products(self, response):
# fill out 1/2 of fields of ProductItem
item = response.meta['item']
item[name] = response.xpath(...)
item[id] = response.xpath(...)
category_url = 'https://www.someurl.com/category?'
yield scrapy.Request(category_url, self.parse_products2, meta={'item':item})
def parse_products2(self, response):
# fill out rest of fields of ProductItem
item = response.meta['item']
item[img_url] = response.xpath(...)
item[link_url] = response.xpath(...)
stock_url = 'https://www.someurl.com/stock?'
yield scrapy.Request(stock_url, self.parse_final, meta={'item':item})
def parse_final(self, response):
item = response.meta['item']
for each prod in response:
if prod.id == item['id'] & !prod.in_stock:
#drop item
Question: I was told before to handle the item-dropping logic in the pipeline. But whether I drop an item or not depends on making another GET request. Should I still move this logic to the pipelines/ is this possible without inheriting scrapy.Spider?

Moving the item dropping logic to the pipeline is probably the best design.
You can use the (undocumented) scrapy engine api to download requests in a pipeline. Example assuming the stock info for all items can be accessed from a single url:
import scrapy
from scrapy.exceptions import DropItem
from twisted.internet.defer import inlineCallbacks
class StockPipeline(object):
#inlineCallbacks
def open_spider(self, spider):
req = scrapy.Request(stock_url)
response = yield spider.crawler.engine.download(req, spider)
# extract the stock info from the response
self.stock_info = response.text
def process_item(self, item, spider):
# check if the item should be dropped
if item['id'] not in self.stock_info:
raise DropItem
return item
If there is a separate, per-item url for stock info, you'd simply do the downloading in process_item() instead.

Related

Interpreting callbacks and cb_kwargs with scrapy

I'm in reach of a personal milestone with scrapy. The aim is to properly understand the callback and cb_kwargs, I've read the documentation countless times but I learn best with visual code, practice and an explanation.
I have an example scraper, the aim is to grab the book name, price and go into each book page and extract a single piece of information. I'm trying to understand how to properly get information on the next few pages also, which I know is dependent on understanding the operation of callbacks.
When I run my script It returns results only for the first page, how do I get the additional pages?
Here's my scraper:
class BooksItem(scrapy.Item):
items = Field(output_processor = TakeFirst())
price = Field(output_processor = TakeFirst())
availability = Field(output_processor = TakeFirst())
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ['https://books.toscrape.com']
def start_request(self):
for url in self.start_url:
yield scrapy.Request(
url,
callback = self.parse)
def parse(self, response):
data = response.xpath('//div[#class = "col-sm-8 col-md-9"]')
for books in data:
loader = ItemLoader(BooksItem(), selector = books)
loader.add_xpath('items','.//article[#class="product_pod"]/h3/a//text()')
loader.add_xpath('price','.//p[#class="price_color"]//text()')
for url in [books.xpath('.//a//#href').get()]:
yield scrapy.Request(
response.urljoin(url),
callback = self.parse_book,
cb_kwargs = {'loader':loader})
for next_page in [response.xpath('.//div/ul[#class="pager"]/li[#class="next"]/a//#href').get()]:
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
def parse_book(self, response, loader):
book_quote = response.xpath('//p[#class="instock availability"]//text()').get()
loader.add_value('availability', book_quote)
yield loader.load_item()
I believe the issue is with the part where I try to grab the next few pages. I have tried an alternative approach using the following:
def start_request(self):
for url in self.start_url:
yield scrapy.Request(
url,
callback = self.parse,
cb_kwargs = {'page_count':0}
)
def parse(self, response, next_page):
if page_count > 3:
return
...
...
page_count += 1
for next_page in [response.xpath('.//div/ul[#class="pager"]/li[#class="next"]/a//#href').get()]:
yield response.follow(next_page, callback=self.parse, cb_kwargs = {'page_count': page_count})
However, I get the following error with this approach:
TypeError: parse() missing 1 required positional argument: 'page_cntr'
It should be start_requests, and self.start_urls (inside the function).
get() will return the first result, what you want is getall() in order to return a list.
There is no need for a for loop for the "next_page" part, it's not a mistake just unnecessary.
In the line for url in books.xpath you're getting every url twice, again not a mistake but still...
Here data = response.xpath('//div[#class = "col-sm-8 col-md-9"]') you don't select the books one by one, you select the whole books container, you can check that len(data.getall()) == 1.
book_quote = response.xpath('//p[#class="instock availability"]//text()').get() will return \n, look at the source at try to find out why (hint: 'i' tag).
Compare your code to this and see what I changed:
import scrapy
from scrapy import Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class BooksItem(scrapy.Item):
items = Field(output_processor=TakeFirst())
price = Field(output_processor=TakeFirst())
availability = Field(output_processor=TakeFirst())
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ['https://books.toscrape.com']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback=self.parse)
def parse(self, response):
data = response.xpath('//div[#class = "col-sm-8 col-md-9"]//li')
for books in data:
loader = ItemLoader(BooksItem(), selector=books)
loader.add_xpath('items', './/article[#class="product_pod"]/h3/a//text()')
loader.add_xpath('price', './/p[#class="price_color"]//text()')
for url in books.xpath('.//h3/a//#href').getall():
yield scrapy.Request(
response.urljoin(url),
callback=self.parse_book,
cb_kwargs={'loader': loader})
next_page = response.xpath('.//div/ul[#class="pager"]/li[#class="next"]/a//#href').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_book(self, response, loader):
# option 1:
book_quote = response.xpath('//p[#class="instock availability"]/i/following-sibling::text()').get().strip()
# option 2:
# book_quote = ''.join(response.xpath('//div[contains(#class, "product_main")]//p[#class="instock availability"]//text()').getall()).strip()
loader.add_value('availability', book_quote)
yield loader.load_item()

Scrapy callback asynchronous

def parse(self, response):
category_names = []
category_urls = []
for item in response.css("#zg_browseRoot ul li"):
category_url = item.css("a").css(self.CSS_URL).extract()
category_name = item.css("a").css(self.CSS_TEXT).extract()
category_url = [
self.parse_url(category_url, 4) for category_url in category_url
]
(category_url,) = category_url
(category_name,) = category_name
category_names.append(category_name)
category_urls.append(category_url)
for c_name, url in zip(category_names, category_urls):
self.c_name = [c_name]
yield scrapy.Request(url, callback=self.parse_categories)
def parse_url(self, url, number):
parse = urlparse(url)
split = parse.path.split("/")[:number]
return f'{self.BASE_URL}{"/".join(split)}'
def parse_categories(self, response):
sub_names = []
sub_urls = []
for item in response.css("#zg_browseRoot ul ul li"):
sub_name = item.css("a").css(self.CSS_TEXT).extract()
sub_url = item.css("a").css(self.CSS_URL).extract()
sub_url = [self.parse_url(sub_url, 5) for sub_url in sub_url]
(sub_url,) = sub_url
(sub_name,) = sub_name
sub_names.append(sub_name)
sub_urls.append(sub_url)
for sub_name, url in zip(sub_names, sub_urls):
self.sub_name = [sub_name]
# print("{}: {}, {}".format(url, self.sub_name, self.c_name))
yield scrapy.Request(url, callback=self.parse_subcategories)
def parse_subcategories(self, response):
url = self.parse_url(response.request.url, 5)
print(f"{self.c_name}, {self.sub_name}, {url}")
Hello everyone,
I'm having an issue with my Scrapy approach. I'm trying to scrape page which has categories and subcategories in which are items. I want to include category and subcategory with each item scraped.
The problem is that the Scrapys callback function is asynchronous and zipping the URLs with names doesn't seem to work, because the for loop is processed first, URLs are stored in a generator and names are staying behind. Can anyone help me to work around this?
Thanks in advance,
Daniel.
You can pass arbitrary data along with the requests by using th cb_kwargs parameter. You can read about the details here.
Here is a simplified example:
def parse(self, response):
rows = response.xpath('//div[#id="some-element"]')
for row in rows:
request_url = row.xpath('a/#href').get()
category = row.xpath('a/text()').get()
yield Request(
url=request_url,
callback=self.parse_category,
cb_kwargs={'category': category}
)
def parse_category(self, response, category): # Notice category arg in the func
# Process here
yield item
The data inserted in cb_kwargs is passed as a keyword arg into the callback function, so the key in the dict must match the name of the argument in the method definiton.
cb_kwargs were introduced in Scrapy v1.7, if you are using an older version you should use the meta param. You can read about it here, notice that the use is slightly different.

Multiple Request to Single Field in Scrapy

I am trying to scrape a website using Scrapy. Example Link is: Here.
I am able to get some data using css selectors. I also need to fetch all image urls of each item. Now an item can have multiple colours. When we click on another colour, it actually fetch images from another url in the browser. So, I need to generate manual requests (due to multiple colours) and attach "meta" to store image urls from others urls into a SINGLE ITEM FIELD.
Here is my Scrapy code:
def get_image_urls(self, response):
item = response.meta['item']
if 'image_urls' in item:
urls = item['image_urls']
else:
urls = []
urls.extend(response.css('.product-image-link::attr(href)').extract())
item['image_urls'] = urls
next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()
#print(item['image_urls'])
yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})
def parse(self, response):
output = JulesProduct()
output['name'] = self.get_name(response)
# Now get the recursive img urls
response.meta['item'] = output
self.get_image_urls(response)
return output
Ideally, I should return output object to have all of the required data. My question is why I am not getting output['image_urls']? Because when I uncomment print statement in get_image_urls function, I see 3 crawled urls and 3 print statements with url appended after each other. I need them in the parse function. I'm not sure if I'm able to dictate my issue. Can anybody help?
Your parse method is returning the output before the get_image_urls requests are done.
You should only yield or return your final item and at the end of your recursive logic. Something like this should work:
def parse(self, response):
output = JulesProduct()
output['name'] = self.get_name(response)
yield Request(response.url, callback=self.get_image_urls, meta={'item': item}, dont_filter=True)
def get_image_urls(self, response):
item = response.meta['item']
if 'image_urls' in item:
urls = item['image_urls']
else:
urls = []
urls.extend(response.css('.product-image-link::attr(href)').extract())
item['image_urls'] = urls
next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()
if len(next_url) > 0:
yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})
else:
yield item

Scrapy can't reach callback for Request

I use next code in my spider:
def parse_item(self, response):
item = MyItem()
item['price'] = [i for i in self.get_usd_price(response)]
return item
def get_usd_price(self, response):
yield FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency
)
def get_currency(self, response):
self.log('lalalalala')
The problem is I can't reach my get_currency callback. In my logger I see that price item takes [<POST url>] value. What am I doing wrong? I tried to add dont_filter to FormRequest, change FormRequest to simple get Request
Update
I've also tried GHajba's suggestion (so far without success):
def parse_item(self, response):
item = MyItem()
self.get_usd_price(response, item)
return item
def get_usd_price(self, response, item):
request = FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency
)
request.meta['item'] = item
yield request
def get_currency(self, response):
self.log('lalalalala')
item = response.meta['item']
item['price'] = 123
return item
This is not how scrapy works, you can only yield a request or an item on every method, but you can't yield the response this way, If you want to update the price information for the item and then yield it you should do something like:
def parse_item(self, response):
item = MyItem()
# populate the item with this response data
yield FormRequest(
'url',
formdata={'key': 'value'},
callback=self.get_currency, meta={'item':item}
)
def get_currency(self, response):
self.log('lalalalala')
item = response.meta['item']
item['price'] = 123 # get your price from the response body.
# keep populating the item with this response data
yield item
So check that for passing information between requests, you need to use the meta parameter.
Your problem is that you assign the values of the generator created in get_usd_price to your item. You can solve this with changing the method and how you call it.
You have to yield the FormRequest but you mustn't use this value to have an effect with Scrapy. Just call the function get_usd_price without assigning it to item['price']:
self.get_usd_price(response, item)
You have to provide item to your function because Scrapy works asynchronous so you cannot be sure when the FormRequest is executing. Now you have to pass along the item as a meta parameter of the FormRequest and then you can access the item in the get_currency function and yield the item there.
You can read more about meta in the docs: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta

Scrapy parse list of urls, open one by one and parse additional data

I am trying to parse a site, an e-store. I parse a page with products, which are loaded with ajax, get urls of these products,and then parse additional info of each product following these parced urls.
My script gets the list of first 4 items on the page, their urls, makes the request, parses add info, but then not returning into the loop and so spider closes.
Could somebody help me in solving this? I'm pretty new to this kind of stuff, and ask here when totally stuck.
Here is my code:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy_sokos.items import SokosItem
class SokosSpider(Spider):
name = "sokos"
allowed_domains = ["sokos.fi"]
base_url = "http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=%s&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151"
start_urls = [
"http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=0&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151",
]
for i in range(0, 8, 4):
start_urls.append((base_url) % str(i))
def parse(self, response):
products = Selector(response).xpath('//div[#class="product-listing product-grid"]/article[#class="product product-thumbnail"]')
for product in products:
item = SokosItem()
item['url'] = product.xpath('//div[#class="content"]/a[#class="image"]/#href').extract()[0]
yield Request(url = item['url'], meta = {'item': item}, callback=self.parse_additional_info)
def parse_additional_info(self, response):
item = response.meta['item']
item['name'] = Selector(response).xpath('//h1[#class="productTitle"]/text()').extract()[0].strip()
item['description'] = Selector(response).xpath('//div[#id="kuvaus"]/p/text()').extract()[0]
euro = Selector(response).xpath('//strong[#class="special-price"]/span[#class="euros"]/text()').extract()[0]
cent = Selector(response).xpath('//strong[#class="special-price"]/span[#class="cents"]/text()').extract()[0]
item['price'] = '.'.join(euro + cent)
item['number'] = Selector(response).xpath('//#data-productid').extract()[0]
yield item
The AJAX requests you are simulating are caught by the Scrapy "duplicate url filter".
Set dont_filter to True when yielding a Request:
yield Request(url=item['url'],
meta={'item': item},
callback=self.parse_additional_info,
dont_filter=True)

Categories