Is it possible to access the name of the current spider in a feed exporter?
The doc about storage URI parameters might help.
Or, if you are building your own:
The methods used by exporters support passing the spider object to it.
For example:
def open_spider(self, spider):
print spider.name
def close_spider(self, spider):
print spider.name
def item_scraped(self, item, spider):
print spider.name
Related
I have a scrapy process which successfully parses items and sub-items, but I can't see whether there's a final hook which would allow me to transform the final data result after everything has been parsed, but before it is formatted as output.
My spider is doing something like this:
class MySpider(scrapy.Spider):
def parse(self, response, **kwargs):
for part in [1,2,3]:
url = f'{response.request.url}?part={part}'
yield scrapy.Request(url=url, callback=self.parse_part, meta={'part': part})
def parse_part(self, response, **kwargs)
# ...
for subpart in part:
yield {
'title': self.get_title(subpart),
'tag': self.get_tag(subpart)
}
}
This works well, but I haven't been able to figure out where I can take the complete resulting structure and transform it before outputting it to json (or whatever). I thought maybe I could do this in the process_spider_output call of Middleware, but this only seems to give me the single items, not the final structure.
You can use this method to do something after the spider has closed:
def spider_closed(self):
However, you won't be able to modify items in the method. To modify items you need to write a custom pipeline. In the pipeline you write a method which gets called every time your spider yields an item. So in the method you could save all items to a list and then transform all items in the list in the Pipeline method close_spider
Read here on how to write your own pipeline
Example:
Let's say you want to have all you items as JSON to maybe send a request to an API. You have to activate your pipeline in settings.py for it to be used.
import json
class MyPipeline:
def __init__(self, *args, **kwargs):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
# In the method to can itterate self.items and transform them to your preference.
json_data = json.dumps(self.items)
print(json_data)
I wrote a custom pipeline to get the node names that I wanted:
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('crawl.xml', 'w',encoding='utf-8')
self.files[spider] = file
self.exporter = XmlItemExporter(file,item_element='job', root_element='jobs', indent=1)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
self.uploadftp(spider)
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Now I can't figure out how to export with FTP instead of just local storage.
To change item data, pipelines are great. And there are indeed export use cases where they also make sense (e.g. splitting items across multiple files).
To change the output format, however, it may be better to implement a custom feed exporter, register it in FEED_EXPORTERS and enable it in FEED_FORMAT.
There’s no extensive documentation about creating custom feed exporters, but if you have a look at the implementation of XmlItemExporter you should be able to figure things out.
In fact, looking at your code and XmlItemExporter’s you may simply need to subclass XmlItemExporter, change its __init__ method to pass item_element='job', root_element='jobs' to the parent __init__, and use the FEED_EXPORT_INDENT setting to define the desired indentation (1).
I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine.
I've attempted to follow this fairly thorough answer here:
How to create custom Scrapy Item Exporter?
but when I run the file I can see it gather the data in the shell but it does not output it at all.
Below I've copied in order:
Exporter,
Pipeline,
Settings,
Exporter:
from scrapy.exporters import JsonItemExporter
class XYZExport(JsonItemExporter):
def __init__(self, file, **kwargs):
super().__init__(file)
def start_exporting(self):
self.file.write(b)
def finish_exporting(self):
self.file.write(b)
I'm struggling to determine what goes in the self.file.write parentheses?
Pipeline:
from exporters import XYZExport
class XYZExport(object):
def __init__(self, file_name):
self.file_name = file_name
self.file_handle = None
#classmethod
def from_crawler(cls, crawler):
output_file_name = crawler.settings.get('FILE_NAME')
return cls(output_file_name)
def open_spider(self, spider):
print('Custom export opened')
file = open(self.file_name, 'wb')
self.file_handle = file
self.exporter = XYZExport(file)
self.exporter.start_exporting()
def close_spider(self, spider):
print('Custom Exporter closed')
self.exporter.finish_exporting()
self.file_handle.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Settings:
FILE_NAME = 'C:\Apps Ive Built\WebScrape Python\XYZ\ScrapeOutput.json'
ITEM_PIPELINES = {
'XYZ.pipelines.XYZExport' : 600,
}
I hope/am afraid its a simple omission because that seems to be my MO, but I'm very new to scraping and this is the first time I've tried to do it this way.
If there is a more stable way to export this data I'm all ears, otherwise can you tell me what I've missed, that is preventing the data from being exported? or preventing the exporter from being properly called.
[Edited to change the pipeline name in settings]
Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem.
Well I don't want the mysql things inside the spider and in the pipeline I get a problem.
If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message
'None Type' object has no attribute getUrl
I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console).
Has somebody an idea how to get the pipeline object inside the spider?
MySpider.py
def __init__(self):
self.pipe = None
def start_requests(self):
url = self.pipe.getUrl()
scrapy.Request(url,callback=self.parse)
Pipeline.py
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
spider.pipe = self
def getUrl(self):
...
Scrapy pipelines already have expected methods of open_spider and close_spider
Taken from docs: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider
open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened
close_spider(self, spider)
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed
However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.
What you should do is open up db and read urls in your spider itself.
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
#classmethod
def from_crawler(self, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.start_urls = self.get_urls_from_db()
return spider
def get_urls_from_db(self):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
I'm using accepted solution but not works as expected.
TypeError: get_urls_from_db() missing 1 required positional argument: 'self'
Here's the worked one from my side
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
def __init__(self, db_dsn):
self.db_dsn = db_dsn
self.start_urls = self.get_urls_from_db(db_dsn)
#classmethod
def from_crawler(cls, crawler):
spider = cls(
db_dsn=os.getenv('DB_DSN', 'mongodb://localhost:27017'),
)
spider._set_crawler(crawler)
return spider
def get_urls_from_db(self, db_dsn):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.
According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.