I have a item pipeline to process prices. I am having errors while processing item in this pipeline. But scrapy error doesn't tells which url produced error. Is there any way i can access the request object inside the pipeline
def process_item(self, item, spider):
"""
:param self:
:param item:
:param spider:
"""
print dir(spider) # No request object here...
quit()
if not all(item['price']):
raise DropItem
item['price']['new'] = float(re.sub(
"\D", "", item['price']['new']))
item['price']['old'] = float(re.sub(
"\D", "", item['price']['old']))
try:
item['price']['discount'] = math.ceil(
100 - (100 * (item['price']['new'] /
item['price']['old'])))
except ZeroDivisionError as e:
print "Error in calculating discount {item} {request}".format(item=item, request=spider.request) # here I want to see the culprit url...
raise DropItem
return item
You can't from an ItemPipeline, you would be able to access the response (and response.url) from an spider middleware but I think the easier solution would be to add a temporary url field assigned when you yield the item, something like:
yield {...
'url': response.url,
...}
The the url can be easily accessed inside the pipeline.
In your spider class, whatever class variables you define here, can be accessed within your pipeline via spider.variable_name
class MySpider(scrapy.Spider):
name = "walmart"
my_var = "TEST"
my_dict = {'test': "test_val"}
Now in your pipeline you can do spider.name, spider.my_var, spider.my_dict.
Related
In Scrapy 2.4.x on Python 3.8.x I am yielding an item with the purpose to save some stats to a DB. The scraper has another Item that gets yielded as well.
While the name of the item is present in the main script "StatsItem", it is lost within the other class. I am using the name of the item to decide which method to call:
in scraper.py:
import scrapy
from crawler.items import StatsItem, OtherItem
class demo(scrapy.Spider):
def parse_item(self, response):
stats = StatsItem()
stats['results'] = 10
yield stats
print(type(stats).__name__)
# Output: StatsItem
print(stats)
# Output: {'results': 10}
in pipeline.py
import scrapy
from crawler.items import StatsItem, OtherItem
class mysql_pipeline(object):
def process_item(self, item, spider):
print(type(item).__name__)
# Output: NoneType
if isinstance(item, StatsItem):
self.save_stats(item, spider)
elif isinstance(item, OtherItem):
# call other method
return item
The output of print in the first class is "StatsItem", while it is "NoneType" within the pipeline, therefore the method save_stats() gets never called.
I am pretty new to Python, so there might be a better way of doing this. There is no error message or exception I am aware of. Any help is greatly appreciated.
You can't use yield outside of a function imo.
I was finaly able to locate the problem. The particular crawler was nearly identical to all other ones that did not have this issue but with one exception, I was custom setting the item pipeline:
custom_settings.update({
'ITEM_PIPELINES' : {
'crawler.pipelines.mysql_pipeline': 301,
}
})
Removing this, fixed the issue.
I'm using scrapy to crawl a website. The first call seems ok and collects some data. For every subsequent request I need some information from another request. For programing simplification, I separated the different requests into different method calls. But it seems that scrapy does not provide method calls with some special parameter. Every sub-call won't be executed.
I tried already a few different things:
Called a instance method with self.sendQueryHash(response, tagName, afterHash)
Called a static method with sendQueryHash(response, tagName, afterHash) and changed the indent
Removed the method call and it worked. I saw the sendQueryHash output on the logger.
import scrapy
import re
import json
import logging
class TestpostSpider(scrapy.Spider):
name = 'testPost'
allowed_domains = ['test.com']
tags = [
"this"
,"that" ]
def start_requests(self):
requests = []
for i, value in enumerate(self.tags):
url = "https://www.test.com/{}/".format(value)
requests.append(scrapy.Request(
url,
meta={'cookiejar': i},
callback=self.parsefirstAccess))
return requests
def parsefirstAccess(self, response):
self.logger.info("parsefirstAccess")
jsonData = response.text
# That call works fine
tagName, hasNext, afterHash = self.extractFirstNextPageData(jsonData)
yield {
'json':jsonData,
'requestTime':int(round(time.time() * 1000)),
'requestNumber':0
}
if not hasNext:
self.logger.info("hasNext is false")
# No more data available stop processing
return
else:
self.logger.info("hasNext is true")
# Send request to get the query hash of the current tag
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
## 3.
def sendQueryHash(self, response, tagName, afterHash):
self.logger.info("sendQueryHash")
request = scrapy.Request(
"https://www.test.com/static/bundles/es6/TagPageContainer.js/21d3cb18e725.js",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parseQueryHash,
dont_filter=True)
request.cb_kwargs['tagName'] = tagName
request.cb_kwargs['afterHash'] = afterHash
yield request
def extractFirstNextPageData(self, json):
return "data1", True, "data3"
I expect that the sendQueryHash output is shown but it never happen. Only wenn I comment the lines self.sendQueryHash and def sendQueryHash out.
That's only one example of the behavior what I don't expect.
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
will just create a generator that you do nothing with. You need to make sure you yield your Request back to the scrapy engine. Since it is just a single request that is returned you should be able to use return instead of yield from sendQueryHash and then directly yield the Request by replacing the above line with
yield self.sendQueryHash(response, tagName, afterHash)
I have the following Python script using Scrapy:
import scrapy
class ChemSpider(scrapy.Spider):
name = "site"
def start_requests(self):
urls = [
'https://www.site.com.au'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
category_links = response.css('li').xpath('a/#href').getall()
category_links_filtered = [x for x in category_links if 'shop-online' in x] # remove non category links
category_links_filtered = list(dict.fromkeys(category_links_filtered)) # remove duplicates
for category_link in category_links_filtered:
if "medicines" in category_link:
next_page = response.urljoin(category_link) + '?size=10'
self.log(next_page)
yield scrapy.Request(next_page, callback=self.parse_subcategories)
def parse_subcategories(self, response):
for product in response.css('div.Product'):
yield {
'category_link': response.url,
'product_name': product.css('img::attr(alt)').get(),
'product_price': product.css('span.Price::text').get().replace('\n','')
}
My solution will run multiple instances of this script, each scraping a different subset of information from different 'categories'. I know you can run scrapy from the command line to output to a json file, but i want do to the output to a file from within the function, so each instance writes to a different file. Being a beginner with Python, I'm not sure where to go with my script. I need to get the output of the yield into a file while the script is executing. How do i achieve this? There will be hundreds of rows scraped, and I'm not familiar enough with how yield works to understand how to 'return' from it a set of data (or a list) that can then be written to the file.
You are looking to append a file. But being file writing an I/O operation, you need to lock the file from being written by other processes while a process is writing.
Easiest way to achieve is to write in different random files (files with random names) in a directory and concatenating them all using another process.
First let me suggest you some changes to your code. If you want to remove duplicates i you could use a set like this:
category_links_filtered = (x for x in category_links if 'shop-online' in x) # remove non category links
category_links_filtered = set(category_links_filtered) # remove duplicates
note that i'm also changing the [ to ( to make a generator instead of a list and save some memory. Search more about generators: https://www.python-course.eu/python3_generators.php
OK then the solution for your problem is using an Item Pipeline (https://docs.scrapy.org/en/latest/topics/item-pipeline.html), what this does perfom some action on every item yielded from your function parse_subcategories. What you do is add a class in your pipelines.py file and enable this pipeline in settings.py. This is:
In settings.py:
ITEM_PIPELINES = {
'YOURBOTNAME.pipelines.CategoriesPipeline': 300, #the number here is the priority of the pipeline, dont worry and just leave it
}
In pipelines.py:
import json
from urlparse import urlparse #this is library to parse urls
class CategoriesPipeline(object):
#This class dynamically saves the data depending on the category name obtained in the url or by an atrtribute
def open_spider(self, spider):
if hasattr(spider, 'filename'):
#the filename is an attribute set by -a filename=somefilename
filename = spider.filename
else:
#you could also set the name dynamically from the start url like this, if you set -a start_url=https://www.site.com.au/category-name
try:
filename = urlparse(spider.start_url).path[1:] #this returns 'category-name' and replace spaces with _
except AttributeError:
spider.crawler.engine.close_spider(self, reason='no start url') #this should not happen
self.file = open(filename+'.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
In spiders/YOURBOTNAME.py modify this:
class ChemSpider(scrapy.Spider):
name = "site"
if !hasattr(self, 'start_url'):
spider.crawler.engine.close_spider(self, reason='no start url') #we need a start url
start_urls = [ self.start_url ] #see why this works on https://docs.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
def parse(self, response):#...
and then you start your crawl with this command: scrapy crawl site -a start_url=https://www.site.com.au/category-name and you could optionally add -a filename=somename
I'm trying to download file using a custom scrapy pipeline. However the file url is not trivial to obtain. Here is the steps :
pipeline get an item containing a pdfLink attribute
the page at pdfLink is a wrapper of the pdf, which is embedded in an iframe
I then extend the FilesPipeline class :
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item['pdfLink'],
callback=self.get_pdfurl)
def get_pdfurl(self, response):
import logging
logging.info('...............')
print response.url
yield scrapy.Request(response.css('iframe::attr(src)').extract()[0])
However :
files that are downloaded are the web pages pointed out by pdfLink and not the embedded pdf file.
neither the print or logging.info are shown in logs.
It then seems that the get_pdfurl is not called back. Am I doing something wrong ? How is it possible to download such a nested file ?
Found a solution by using two consecutive pipelines, where the first is build like in Item pipeline - Take screenshot of item.
class PdfWrapperPipeline(object):
def process_item(self, item, spider):
wrapper_url = self.WRAPPER_URL.format(item.get('pdfLink'))
request = scrapy.Request(item.get('pdfLink'))
dfd = spider.crawler.engine.download(request, spider)
dfd.addBoth(self.return_item, item)
return dfd
def return_item(self, response, item):
if response.status != 200:
# Error happened, return item.
return item
url = response.css('iframe::attr(src)').extract()[0]
item['pdfUrl'] = url
return item
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item.get('pdfUrl'))
and then set in settings.py the wrapper pipeline priority higher than the pdf pipeline priority.
ITEM_PIPELINES = {
'project.pipelines.PdfWrapperPipeline': 1,
'project.pipelines.PdfPipeline': 2,
}
Response has been first posted in the scrapy's github
I'm working with scrapy. In my current project I am capturing the text from pdf files. I want to send this to a pipeline for parsing. Right now I have:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
item =OveItem()
item['pdf_text']=doc
return item
pipelines.py
class OvePipeline(object):
def process_item(self, item, spider):
.......
return item
This works ,but I think it would be cleaner just to yield the result directly and not have to attach the result to an item to get it to a pipeline, like:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
yield slate.PDF(in_memory_pdf)
Is this possible?
According to Scrapy documentation, a spider callback has to either return a Request instance(s), dictionary(ies) or Item instance(s):
This method, as well as any other Request callback, must return an
iterable of Request and/or dicts or Item objects.
So, if you don't want to define a special "item" for the pdf content, simply wrap it into a dict:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
return {'pdf_text': doc}