I'm working with scrapy. In my current project I am capturing the text from pdf files. I want to send this to a pipeline for parsing. Right now I have:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
item =OveItem()
item['pdf_text']=doc
return item
pipelines.py
class OvePipeline(object):
def process_item(self, item, spider):
.......
return item
This works ,but I think it would be cleaner just to yield the result directly and not have to attach the result to an item to get it to a pipeline, like:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
yield slate.PDF(in_memory_pdf)
Is this possible?
According to Scrapy documentation, a spider callback has to either return a Request instance(s), dictionary(ies) or Item instance(s):
This method, as well as any other Request callback, must return an
iterable of Request and/or dicts or Item objects.
So, if you don't want to define a special "item" for the pdf content, simply wrap it into a dict:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
return {'pdf_text': doc}
Related
I'm using scrapy to crawl a website. The first call seems ok and collects some data. For every subsequent request I need some information from another request. For programing simplification, I separated the different requests into different method calls. But it seems that scrapy does not provide method calls with some special parameter. Every sub-call won't be executed.
I tried already a few different things:
Called a instance method with self.sendQueryHash(response, tagName, afterHash)
Called a static method with sendQueryHash(response, tagName, afterHash) and changed the indent
Removed the method call and it worked. I saw the sendQueryHash output on the logger.
import scrapy
import re
import json
import logging
class TestpostSpider(scrapy.Spider):
name = 'testPost'
allowed_domains = ['test.com']
tags = [
"this"
,"that" ]
def start_requests(self):
requests = []
for i, value in enumerate(self.tags):
url = "https://www.test.com/{}/".format(value)
requests.append(scrapy.Request(
url,
meta={'cookiejar': i},
callback=self.parsefirstAccess))
return requests
def parsefirstAccess(self, response):
self.logger.info("parsefirstAccess")
jsonData = response.text
# That call works fine
tagName, hasNext, afterHash = self.extractFirstNextPageData(jsonData)
yield {
'json':jsonData,
'requestTime':int(round(time.time() * 1000)),
'requestNumber':0
}
if not hasNext:
self.logger.info("hasNext is false")
# No more data available stop processing
return
else:
self.logger.info("hasNext is true")
# Send request to get the query hash of the current tag
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
## 3.
def sendQueryHash(self, response, tagName, afterHash):
self.logger.info("sendQueryHash")
request = scrapy.Request(
"https://www.test.com/static/bundles/es6/TagPageContainer.js/21d3cb18e725.js",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parseQueryHash,
dont_filter=True)
request.cb_kwargs['tagName'] = tagName
request.cb_kwargs['afterHash'] = afterHash
yield request
def extractFirstNextPageData(self, json):
return "data1", True, "data3"
I expect that the sendQueryHash output is shown but it never happen. Only wenn I comment the lines self.sendQueryHash and def sendQueryHash out.
That's only one example of the behavior what I don't expect.
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
will just create a generator that you do nothing with. You need to make sure you yield your Request back to the scrapy engine. Since it is just a single request that is returned you should be able to use return instead of yield from sendQueryHash and then directly yield the Request by replacing the above line with
yield self.sendQueryHash(response, tagName, afterHash)
I have the following Python script using Scrapy:
import scrapy
class ChemSpider(scrapy.Spider):
name = "site"
def start_requests(self):
urls = [
'https://www.site.com.au'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
category_links = response.css('li').xpath('a/#href').getall()
category_links_filtered = [x for x in category_links if 'shop-online' in x] # remove non category links
category_links_filtered = list(dict.fromkeys(category_links_filtered)) # remove duplicates
for category_link in category_links_filtered:
if "medicines" in category_link:
next_page = response.urljoin(category_link) + '?size=10'
self.log(next_page)
yield scrapy.Request(next_page, callback=self.parse_subcategories)
def parse_subcategories(self, response):
for product in response.css('div.Product'):
yield {
'category_link': response.url,
'product_name': product.css('img::attr(alt)').get(),
'product_price': product.css('span.Price::text').get().replace('\n','')
}
My solution will run multiple instances of this script, each scraping a different subset of information from different 'categories'. I know you can run scrapy from the command line to output to a json file, but i want do to the output to a file from within the function, so each instance writes to a different file. Being a beginner with Python, I'm not sure where to go with my script. I need to get the output of the yield into a file while the script is executing. How do i achieve this? There will be hundreds of rows scraped, and I'm not familiar enough with how yield works to understand how to 'return' from it a set of data (or a list) that can then be written to the file.
You are looking to append a file. But being file writing an I/O operation, you need to lock the file from being written by other processes while a process is writing.
Easiest way to achieve is to write in different random files (files with random names) in a directory and concatenating them all using another process.
First let me suggest you some changes to your code. If you want to remove duplicates i you could use a set like this:
category_links_filtered = (x for x in category_links if 'shop-online' in x) # remove non category links
category_links_filtered = set(category_links_filtered) # remove duplicates
note that i'm also changing the [ to ( to make a generator instead of a list and save some memory. Search more about generators: https://www.python-course.eu/python3_generators.php
OK then the solution for your problem is using an Item Pipeline (https://docs.scrapy.org/en/latest/topics/item-pipeline.html), what this does perfom some action on every item yielded from your function parse_subcategories. What you do is add a class in your pipelines.py file and enable this pipeline in settings.py. This is:
In settings.py:
ITEM_PIPELINES = {
'YOURBOTNAME.pipelines.CategoriesPipeline': 300, #the number here is the priority of the pipeline, dont worry and just leave it
}
In pipelines.py:
import json
from urlparse import urlparse #this is library to parse urls
class CategoriesPipeline(object):
#This class dynamically saves the data depending on the category name obtained in the url or by an atrtribute
def open_spider(self, spider):
if hasattr(spider, 'filename'):
#the filename is an attribute set by -a filename=somefilename
filename = spider.filename
else:
#you could also set the name dynamically from the start url like this, if you set -a start_url=https://www.site.com.au/category-name
try:
filename = urlparse(spider.start_url).path[1:] #this returns 'category-name' and replace spaces with _
except AttributeError:
spider.crawler.engine.close_spider(self, reason='no start url') #this should not happen
self.file = open(filename+'.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
In spiders/YOURBOTNAME.py modify this:
class ChemSpider(scrapy.Spider):
name = "site"
if !hasattr(self, 'start_url'):
spider.crawler.engine.close_spider(self, reason='no start url') #we need a start url
start_urls = [ self.start_url ] #see why this works on https://docs.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
def parse(self, response):#...
and then you start your crawl with this command: scrapy crawl site -a start_url=https://www.site.com.au/category-name and you could optionally add -a filename=somename
I'm trying to download file using a custom scrapy pipeline. However the file url is not trivial to obtain. Here is the steps :
pipeline get an item containing a pdfLink attribute
the page at pdfLink is a wrapper of the pdf, which is embedded in an iframe
I then extend the FilesPipeline class :
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item['pdfLink'],
callback=self.get_pdfurl)
def get_pdfurl(self, response):
import logging
logging.info('...............')
print response.url
yield scrapy.Request(response.css('iframe::attr(src)').extract()[0])
However :
files that are downloaded are the web pages pointed out by pdfLink and not the embedded pdf file.
neither the print or logging.info are shown in logs.
It then seems that the get_pdfurl is not called back. Am I doing something wrong ? How is it possible to download such a nested file ?
Found a solution by using two consecutive pipelines, where the first is build like in Item pipeline - Take screenshot of item.
class PdfWrapperPipeline(object):
def process_item(self, item, spider):
wrapper_url = self.WRAPPER_URL.format(item.get('pdfLink'))
request = scrapy.Request(item.get('pdfLink'))
dfd = spider.crawler.engine.download(request, spider)
dfd.addBoth(self.return_item, item)
return dfd
def return_item(self, response, item):
if response.status != 200:
# Error happened, return item.
return item
url = response.css('iframe::attr(src)').extract()[0]
item['pdfUrl'] = url
return item
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item.get('pdfUrl'))
and then set in settings.py the wrapper pipeline priority higher than the pdf pipeline priority.
ITEM_PIPELINES = {
'project.pipelines.PdfWrapperPipeline': 1,
'project.pipelines.PdfPipeline': 2,
}
Response has been first posted in the scrapy's github
I have a item pipeline to process prices. I am having errors while processing item in this pipeline. But scrapy error doesn't tells which url produced error. Is there any way i can access the request object inside the pipeline
def process_item(self, item, spider):
"""
:param self:
:param item:
:param spider:
"""
print dir(spider) # No request object here...
quit()
if not all(item['price']):
raise DropItem
item['price']['new'] = float(re.sub(
"\D", "", item['price']['new']))
item['price']['old'] = float(re.sub(
"\D", "", item['price']['old']))
try:
item['price']['discount'] = math.ceil(
100 - (100 * (item['price']['new'] /
item['price']['old'])))
except ZeroDivisionError as e:
print "Error in calculating discount {item} {request}".format(item=item, request=spider.request) # here I want to see the culprit url...
raise DropItem
return item
You can't from an ItemPipeline, you would be able to access the response (and response.url) from an spider middleware but I think the easier solution would be to add a temporary url field assigned when you yield the item, something like:
yield {...
'url': response.url,
...}
The the url can be easily accessed inside the pipeline.
In your spider class, whatever class variables you define here, can be accessed within your pipeline via spider.variable_name
class MySpider(scrapy.Spider):
name = "walmart"
my_var = "TEST"
my_dict = {'test': "test_val"}
Now in your pipeline you can do spider.name, spider.my_var, spider.my_dict.
I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
I use the following command line to get CSV data:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
contrib/exporter/init.py
contrib/feedexport.py
to address some previous issues, that seem to have already been resolved...
Many thanks in advance.
To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py):
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
You can now specify settings in the spider itself.
https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS.
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('#href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.
I found a pretty simple way to solve this issue. The above answers I would still say are more correct, but this is a quick fix. It turns out scrapy pulls the items in alphabetical order. Capitals are also important. So, an item beginning with 'A' will be pulled first, then 'B', 'C', etc, followed by 'a', 'b', 'c'. I have a project going right now where the header names are not extremely important, but I did need the UPC to be the first header for input into another program. I have the following item class:
class ItemInfo(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
A_UPC = scrapy.Field()
ID = scrapy.Field()
time = scrapy.Field()
My CSV file outputs with the headers (in order): A_UPC, ID, item, price, time