Transform final output in scrapy? - python

I have a scrapy process which successfully parses items and sub-items, but I can't see whether there's a final hook which would allow me to transform the final data result after everything has been parsed, but before it is formatted as output.
My spider is doing something like this:
class MySpider(scrapy.Spider):
def parse(self, response, **kwargs):
for part in [1,2,3]:
url = f'{response.request.url}?part={part}'
yield scrapy.Request(url=url, callback=self.parse_part, meta={'part': part})
def parse_part(self, response, **kwargs)
# ...
for subpart in part:
yield {
'title': self.get_title(subpart),
'tag': self.get_tag(subpart)
}
}
This works well, but I haven't been able to figure out where I can take the complete resulting structure and transform it before outputting it to json (or whatever). I thought maybe I could do this in the process_spider_output call of Middleware, but this only seems to give me the single items, not the final structure.

You can use this method to do something after the spider has closed:
def spider_closed(self):
However, you won't be able to modify items in the method. To modify items you need to write a custom pipeline. In the pipeline you write a method which gets called every time your spider yields an item. So in the method you could save all items to a list and then transform all items in the list in the Pipeline method close_spider
Read here on how to write your own pipeline
Example:
Let's say you want to have all you items as JSON to maybe send a request to an API. You have to activate your pipeline in settings.py for it to be used.
import json
class MyPipeline:
def __init__(self, *args, **kwargs):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
# In the method to can itterate self.items and transform them to your preference.
json_data = json.dumps(self.items)
print(json_data)

Related

Scrapy Can duplicates filter be persistent with Jobs?

from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Above code is from Scrapy official website: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
which is used for filtering duplicates.
And as Scrapy documentation suggested, http://doc.scrapy.org/en/latest/topics/jobs.html
To pause and resume a spider, I need to use the Jobs system.
So I'm curious if the Scrapy Jobs system can make duplicates filter persistent in its directory. The way that implements the duplicates filter is so simple that I'm in doubt.
You just need to implement your pipeline so that it reads the JOBDIR setting and, when that setting is defined, your pipeline:
Reads the initial value of self.ids_seen from some file inside the JOBDIR directory.
At run time, it updates that file as new IDs are added to the set.

Scrapy: Using FTP with XmlItemExporter

I wrote a custom pipeline to get the node names that I wanted:
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('crawl.xml', 'w',encoding='utf-8')
self.files[spider] = file
self.exporter = XmlItemExporter(file,item_element='job', root_element='jobs', indent=1)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
self.uploadftp(spider)
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Now I can't figure out how to export with FTP instead of just local storage.
To change item data, pipelines are great. And there are indeed export use cases where they also make sense (e.g. splitting items across multiple files).
To change the output format, however, it may be better to implement a custom feed exporter, register it in FEED_EXPORTERS and enable it in FEED_FORMAT.
There’s no extensive documentation about creating custom feed exporters, but if you have a look at the implementation of XmlItemExporter you should be able to figure things out.
In fact, looking at your code and XmlItemExporter’s you may simply need to subclass XmlItemExporter, change its __init__ method to pass item_element='job', root_element='jobs' to the parent __init__, and use the FEED_EXPORT_INDENT setting to define the desired indentation (1).

modifying urls before sending for fetching in scrapy

I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url
You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
#classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url
SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section
I've just facet this. Apparently you can't really use process_requests because sitemap rules in SitemapSpider are different from Rule objects in CrawlSpider - only the latter can have this argument.
After examining the code it looks like this can be avoided by manually overriding part of SitemapSpider implementation:
class MySpider(SitemapSpider):
sitemap_urls = ['...']
sitemap_rules = [('/', 'parse')]
def start_requests(self):
# override to call custom_parse_sitemap instead of _parse_sitemap
for url in self.sitemap_urls:
yield Request(url, self.custom_parse_sitemap)
def custom_parse_sitemap(self, response):
# modify requests marked to be called with parse callback
for request in super()._parse_sitemap(response):
if request.callback == self.parse:
yield self.modify_request(request)
else:
yield request
def modify_request(self, request):
return request.replace(
# ...
)
def parse(self, response):
# ...

How to get the pipeline object in Scrapy spider

I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.
According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.

Python Scrapy function to be called just before spider_closed signal sent?

I wrote a spider using scrapy, one that makes a whole bunch of HtmlXPathSelector Requests to separate sites. It creates a row of data in a .csv file after each request is (asynchronously) satisfied. It's impossible to see which request is satisfied last, because the request is repeated if no data was extracted yet (occasionally it misses the data a few times). Even though I start with a neat list, the output is jumbled because the rows are written immediately after data is extracted.
Now I'd like to sort that list based on one column, but after every request is done. Can the 'spider_closed' signal be used to trigger a real function? As below, I tried connecting the signal with dispatcher, but this function seems to only print out things, rather than work with variables or even call other functions.
def start_requests(self)
... dispatcher.connect(self.spider_closed, signal=signals.engine_stopped) ....
def spider_closed(spider):
print 'this gets printed alright' # <-only if the next line is omitted...
out = self.AnotherFunction(in) # <-This doesn't seem to run
I hacked together a pipeline to solve this problem for you.
file: Project.middleware_module.SortedCSVPipeline
import csv
from scrapy import signals
class SortedCSVPipeline(object):
def __init__(self):
self.items = []
self.file_name = r'YOUR_FILE_PATH_HERE'
self.key = 'YOUR_KEY_HERE'
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_closed(self, spider):
for item in sorted(self.items, key=lambda k: k[self.key]):
self.write_to_csv(item)
def process_item(self, item, spider):
self.items.append(item)
return item
def write_to_csv(self, item):
writer = csv.writer(open(self.file_name, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
file: settings.py
ITEM_PIPELINES = {"Project.middleware_module.SortedCSVPipeline.SortedCSVPipeline" : 1000}
When running this you won't need to use an item exporter anymore because this pipeline will do the csv writing for you. Also, the 1000 in the pipeline entry in your setting needs to be a higher value than all other pipelines that you want to run before this one. I tested this in my project and it resulted in a csv file sorted by the column I specified! HTH
Cheers

Categories