I wrote a custom pipeline to get the node names that I wanted:
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('crawl.xml', 'w',encoding='utf-8')
self.files[spider] = file
self.exporter = XmlItemExporter(file,item_element='job', root_element='jobs', indent=1)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
self.uploadftp(spider)
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Now I can't figure out how to export with FTP instead of just local storage.
To change item data, pipelines are great. And there are indeed export use cases where they also make sense (e.g. splitting items across multiple files).
To change the output format, however, it may be better to implement a custom feed exporter, register it in FEED_EXPORTERS and enable it in FEED_FORMAT.
There’s no extensive documentation about creating custom feed exporters, but if you have a look at the implementation of XmlItemExporter you should be able to figure things out.
In fact, looking at your code and XmlItemExporter’s you may simply need to subclass XmlItemExporter, change its __init__ method to pass item_element='job', root_element='jobs' to the parent __init__, and use the FEED_EXPORT_INDENT setting to define the desired indentation (1).
Related
I have a scrapy process which successfully parses items and sub-items, but I can't see whether there's a final hook which would allow me to transform the final data result after everything has been parsed, but before it is formatted as output.
My spider is doing something like this:
class MySpider(scrapy.Spider):
def parse(self, response, **kwargs):
for part in [1,2,3]:
url = f'{response.request.url}?part={part}'
yield scrapy.Request(url=url, callback=self.parse_part, meta={'part': part})
def parse_part(self, response, **kwargs)
# ...
for subpart in part:
yield {
'title': self.get_title(subpart),
'tag': self.get_tag(subpart)
}
}
This works well, but I haven't been able to figure out where I can take the complete resulting structure and transform it before outputting it to json (or whatever). I thought maybe I could do this in the process_spider_output call of Middleware, but this only seems to give me the single items, not the final structure.
You can use this method to do something after the spider has closed:
def spider_closed(self):
However, you won't be able to modify items in the method. To modify items you need to write a custom pipeline. In the pipeline you write a method which gets called every time your spider yields an item. So in the method you could save all items to a list and then transform all items in the list in the Pipeline method close_spider
Read here on how to write your own pipeline
Example:
Let's say you want to have all you items as JSON to maybe send a request to an API. You have to activate your pipeline in settings.py for it to be used.
import json
class MyPipeline:
def __init__(self, *args, **kwargs):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
# In the method to can itterate self.items and transform them to your preference.
json_data = json.dumps(self.items)
print(json_data)
I am writing a class representing a file. This class has some optional features: normally files are stored in memory, but sometimes there is a need for storing them on disk, sometimes I want to store them as zip files and so on. I decided to use mixins, where I can subclass File class and in case of need add mixins I actually need in some case. In such situation reading/writing to a file is an operations that requires some preparation and some cleanup (I need to zip file, perform some write e.g. and than again zip updated version). For this purpose I wanted to use custom context managers, to ensure these actions are performed even if there's an exception or return statement in the middle of with statement. Here's my code:
class File(object):
def read(self):
return "file content"
class ZipMixin(object):
def read(self):
with self:
return super(ZipMixin, self).read()
def __enter__(self):
print("Unzipping")
return self
def __exit__(self, *args):
print("Zipping back")
class SaveMixin(object):
def read(self):
with self:
return super(SaveMixin, self).read()
def __enter__(self):
print("Loading to memory")
return self
def __exit__(self, *args):
print("Removing from memory, saving on disk")
class SaveZipFile(SaveMixin, ZipMixin, File):
pass
f = SaveZipFile()
print(f.read())
However, the output is quite disappointing:
Loading to memory
Loading to memory
Removing from memory, saving on disk
Removing from memory, saving on disk
file content
while it should be:
Loading to memory from disk
Unzipping
Zipping back
Removing from memory, saving on disk
file content
Apparently, all calls to super in mixins with context managers are not passed "in chain" to all mixins, but rather two times to first mixin, then directly to superclass (omitting intermediate mixins). I tested it both with python 2 and 3, same result. What is wrong?
What happens?
The "super" call works as you expect it to work, the read methods of both of your mixins are called in the expected order?
However, you use with self: in both of your SaveMixin and ZipMixin classes read methods.
self is the same in both cases, resulting in the same __enter__ and __exit__ methods beeing used, regardless the declaring class.
According to the method resolution order of the SaveZipFile class, the methods of the SaveMixin class are used:
>>> SaveZipFile.__mro__
(<class '__main__.SaveZipFile'>, <class '__main__.SaveMixin'>, <class '__main__.ZipMixin'>, <class '__main__.File'>, <class 'object'>)
In short the read methods of your SaveMixin and ZipMixin classes are called in the correct order, but the with self: uses the __enter__ and __exit__ methods of the SaveMixinclass both times.
How can this be resolved?
It seems like the with statement is not optimal for the usage with Mixins, but a possible solution is using the Decorator Pattern:
class File(object):
def read(self):
return "file content"
class ZipDecorator(object):
def __init__(self, inner):
self.inner = inner
def read(self):
with self:
return self.inner.read()
def __enter__(self):
print("Unzipping")
return self
def __exit__(self, *args):
print("Zipping back")
class SaveDecorator(object):
def __init__(self, inner):
self.inner = inner
def read(self):
with self:
return self.inner.read()
def __enter__(self):
print("Loading to memory")
return self
def __exit__(self, *args):
print("Removing from memory, saving on disk")
class SaveZipFile(object):
def read(self):
decorated_file = SaveDecorator(
ZipDecorator(
File()
)
)
return decorated_file.read()
f = SaveZipFile()
print(f.read())
Output:
Loading to memory
Unzipping
Zipping back
Removing from memory, saving on disk
file content
The self that you're passing around is of type SaveZipFile. If you look at the MRO (method resolution order) of SaveZipFile, it's something like this:
object
/ | \
SaveMixin ZipMixin File
\ | /
SaveZipFile
When you call with self:, it ends up calling self.__enter__(). And since self is of type SaveZipFile, when we look at the MRO paths for that class (going "up" the graph, searching the paths left to right), and we find a match on the first path (in SaveMixin).
If you're going to offer the zip and save functionality as mixins, you're probably better off using the try/finally pattern and letting super determine which class's method should be called and in what order:
class File(object):
def read(self):
return "file content"
class ZipMixin(object):
def read(self):
try:
print("Unzipping")
return super(ZipMixin, self).read()
finally:
print("Zipping back")
class SaveMixin(object):
def read(self):
try:
print("Loading to memory")
return super(SaveMixin, self).read()
finally:
print("Removing from memory, saving on disk")
class SaveZipFile(SaveMixin, ZipMixin, File):
pass
I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine.
I've attempted to follow this fairly thorough answer here:
How to create custom Scrapy Item Exporter?
but when I run the file I can see it gather the data in the shell but it does not output it at all.
Below I've copied in order:
Exporter,
Pipeline,
Settings,
Exporter:
from scrapy.exporters import JsonItemExporter
class XYZExport(JsonItemExporter):
def __init__(self, file, **kwargs):
super().__init__(file)
def start_exporting(self):
self.file.write(b)
def finish_exporting(self):
self.file.write(b)
I'm struggling to determine what goes in the self.file.write parentheses?
Pipeline:
from exporters import XYZExport
class XYZExport(object):
def __init__(self, file_name):
self.file_name = file_name
self.file_handle = None
#classmethod
def from_crawler(cls, crawler):
output_file_name = crawler.settings.get('FILE_NAME')
return cls(output_file_name)
def open_spider(self, spider):
print('Custom export opened')
file = open(self.file_name, 'wb')
self.file_handle = file
self.exporter = XYZExport(file)
self.exporter.start_exporting()
def close_spider(self, spider):
print('Custom Exporter closed')
self.exporter.finish_exporting()
self.file_handle.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Settings:
FILE_NAME = 'C:\Apps Ive Built\WebScrape Python\XYZ\ScrapeOutput.json'
ITEM_PIPELINES = {
'XYZ.pipelines.XYZExport' : 600,
}
I hope/am afraid its a simple omission because that seems to be my MO, but I'm very new to scraping and this is the first time I've tried to do it this way.
If there is a more stable way to export this data I'm all ears, otherwise can you tell me what I've missed, that is preventing the data from being exported? or preventing the exporter from being properly called.
[Edited to change the pipeline name in settings]
I wrote a spider using scrapy, one that makes a whole bunch of HtmlXPathSelector Requests to separate sites. It creates a row of data in a .csv file after each request is (asynchronously) satisfied. It's impossible to see which request is satisfied last, because the request is repeated if no data was extracted yet (occasionally it misses the data a few times). Even though I start with a neat list, the output is jumbled because the rows are written immediately after data is extracted.
Now I'd like to sort that list based on one column, but after every request is done. Can the 'spider_closed' signal be used to trigger a real function? As below, I tried connecting the signal with dispatcher, but this function seems to only print out things, rather than work with variables or even call other functions.
def start_requests(self)
... dispatcher.connect(self.spider_closed, signal=signals.engine_stopped) ....
def spider_closed(spider):
print 'this gets printed alright' # <-only if the next line is omitted...
out = self.AnotherFunction(in) # <-This doesn't seem to run
I hacked together a pipeline to solve this problem for you.
file: Project.middleware_module.SortedCSVPipeline
import csv
from scrapy import signals
class SortedCSVPipeline(object):
def __init__(self):
self.items = []
self.file_name = r'YOUR_FILE_PATH_HERE'
self.key = 'YOUR_KEY_HERE'
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_closed(self, spider):
for item in sorted(self.items, key=lambda k: k[self.key]):
self.write_to_csv(item)
def process_item(self, item, spider):
self.items.append(item)
return item
def write_to_csv(self, item):
writer = csv.writer(open(self.file_name, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
file: settings.py
ITEM_PIPELINES = {"Project.middleware_module.SortedCSVPipeline.SortedCSVPipeline" : 1000}
When running this you won't need to use an item exporter anymore because this pipeline will do the csv writing for you. Also, the 1000 in the pipeline entry in your setting needs to be a higher value than all other pipelines that you want to run before this one. I tested this in my project and it resulted in a csv file sorted by the column I specified! HTH
Cheers
Is it possible to access the name of the current spider in a feed exporter?
The doc about storage URI parameters might help.
Or, if you are building your own:
The methods used by exporters support passing the spider object to it.
For example:
def open_spider(self, spider):
print spider.name
def close_spider(self, spider):
print spider.name
def item_scraped(self, item, spider):
print spider.name