how to use scrapy download images and then upload to s3 server - python

I want to upload images to s3 when the spider closed,
My method now is sending all images in mongodb : upload_s3(ShA.objects.all())
But I want to edit it to send the images the scrapy download this time .
I need to send the variables sh.images from def process_item() to def close_spider() to let mongo filter the item the scrapy crawl this time
how can I edit to reach it?
here is my pipeline:
from mongo.models import ShA
from uploads3 import upload_s3
class ShPipeline(object):
def process_item(self, item, spider):
if isinstance(item, ShaItem):
sh = item.save(commit=False)
sh_exist = ShA.objects.filter(link=sh.link)
if sh_exist:
sh.id = sh_exist[0].id
sh.save()
#sh.images
return item
def close_spider(self, spider,item):
if spider.name == "email":
upload_s3(ShA.objects.all())
#upload_s3(ShA.objects.get(images=sh.images)) no use,need to get sh.images from def process_item

You can simply use self, but I really recommend you using our pipeline.

Related

How to limit a spider crawler to stop after a number of downloads have been reached in scrapy?

I need to go to a website that has a list of files and download each item. The problem is that the daily download is limited (authentication system), so my spider needs to stop when several items are downloaded and he will not be able to download any files from there.
This is what I tried:
settings.py
CLOSESPIDER_ITEMCOUNT = 10
CLOSESPIDER_PAGECOUNT = 50
It does not work because scrapy is async for nature and is not smart enough to consider dropped items.
https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class DownloadProductVersionPipeline(FilesPipeline):
count = 0
def file_path(self, request, response=None, info=None, item=None):
self.count+=1
if self.count > 10:
raise CloseSpider()
adapter = ItemAdapter(item)
fileName = f"{adapter['providerId']}/{adapter['product']['id']}/{adapter['product']['id']}-v{adapter['productVersion']['version']}.zip"
return fileName
Using a pipeline for downloads also does not work because the pipeline is executed for EACH item and does not store values between executions.
It works for me:
settings.py
DOWNLOADER_MIDDLEWARES = {
'project.middlewares.ProjectDownloaderMiddleware': 543,
}
middlewares.py
class ProjectDownloaderMiddleware(object):
def process_response(self, request, response, spider):
if (spider.crawler.stats.get_value('file_status_count/downloaded') is not None and spider.crawler.stats.get_value('file_status_count/downloaded') >= 10):
raise CloseSpider(
'More than 10 items were downloaded from the provider and the spider was suspended to avoid banning')
return response
I decided to use the middleware instead of the pipeline because I believe it is more semantic.
Credits: Jon Clements♦

Scrapy export with headers if empty

As far as I can see nobody has asked this question, and I'm completely stuck trying to solve it. What I have is a spider that sometimes won't return any results (either no results exist or the site could not be scraped due to, say, the robots.txt file), and this results in an empty, headerless, csv file. I have a robot looking to pick up this file, so when it is empty the robot doesn't realise it is finished and in any case without headers cannot understand it.
What I want is to output the csv file with headers every time, even if there are no results. I've tried using json instead but have the same issue - if there is an error or there are no results the file is empty.
I'd be quite happy to call something on the spider closing (for whatever reason, even an error in initialising due to say, a bad url) and writing something to the file.
Thanks.
I solved this by amending my item pipeline and not using the feed exporter in the command line. This allowed me to use close_spider to write, if no results, the header.
I'd obviously welcome any improvements if I've missed something.
Pipeline code:
from scrapy.exceptions import DropItem
from scrapy.exporters import CsvItemExporter
import csv
from Generic.items import GenericItem
class GenericPipeline:
def __init__(self):
self.emails_seen = set()
self.files = {}
def open_spider(self, spider):
self.file = open("results.csv", 'wb')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def close_spider(self, spider):
if not self.emails_seen:
header = GenericItem()
header["email"] = "None Found"
self.exporter.export_item(header)
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
if item["email"] in self.emails_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.emails_seen.add(item["email"])
self.exporter.export_item(item)
return item

Scrapy Splash Screenshot Pipeline not working

I'm trying to save screenshots of scraped webpages with Scrapy Splash. I've copied and pasted the code found here into my pipeline folder: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Here's the code from the url:
import scrapy
import hashlib
from urllib.parse import quote
class ScreenshotPipeline(object):
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
async def process_item(self, item, spider):
encoded_item_url = quote(item["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
response = await spider.crawler.engine.download(request, spider)
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = item["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = "{}.png".format(url_hash)
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
item["screenshot_filename"] = filename
return item
I've also followed the instructions for setting up splash found here: https://github.com/scrapy-plugins/scrapy-splash
When I call the command scrapy crawl spidereverything works correctly except the pipeline.
This is the "Error" I'm seeing.
<coroutine object ScreenshotPipeline.process_item at 0x7f29a9c7c8c0>
The spider is yielding the item correctly, but it will not process the item.
Does anyone have any advice? Thank you.
Edit:
I think what is going on is that Scrapy is calling the process_item() method as you normally would. However according to these docs: https://docs.python.org/3/library/asyncio-task.html a coroutine object must be called differently.
asyncio.run(process_item()) rather than process_item().
I think I may have to modify the source code?
You should use scrapy-splash in a script inside spider not in the pipelines.
I followed this docs and it works for me.

Multi-step / nested scrapy file download

I'm trying to download file using a custom scrapy pipeline. However the file url is not trivial to obtain. Here is the steps :
pipeline get an item containing a pdfLink attribute
the page at pdfLink is a wrapper of the pdf, which is embedded in an iframe
I then extend the FilesPipeline class :
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item['pdfLink'],
callback=self.get_pdfurl)
def get_pdfurl(self, response):
import logging
logging.info('...............')
print response.url
yield scrapy.Request(response.css('iframe::attr(src)').extract()[0])
However :
files that are downloaded are the web pages pointed out by pdfLink and not the embedded pdf file.
neither the print or logging.info are shown in logs.
It then seems that the get_pdfurl is not called back. Am I doing something wrong ? How is it possible to download such a nested file ?
Found a solution by using two consecutive pipelines, where the first is build like in Item pipeline - Take screenshot of item.
class PdfWrapperPipeline(object):
def process_item(self, item, spider):
wrapper_url = self.WRAPPER_URL.format(item.get('pdfLink'))
request = scrapy.Request(item.get('pdfLink'))
dfd = spider.crawler.engine.download(request, spider)
dfd.addBoth(self.return_item, item)
return dfd
def return_item(self, response, item):
if response.status != 200:
# Error happened, return item.
return item
url = response.css('iframe::attr(src)').extract()[0]
item['pdfUrl'] = url
return item
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item.get('pdfUrl'))
and then set in settings.py the wrapper pipeline priority higher than the pdf pipeline priority.
ITEM_PIPELINES = {
'project.pipelines.PdfWrapperPipeline': 1,
'project.pipelines.PdfPipeline': 2,
}
Response has been first posted in the scrapy's github

Scrapy : Program organization when interacting with secondary website

I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of the Site A scrape, I may need to scrape additional information from site B. As far as developing the program, does it make more sense to scrape site B within spider '1' or would it be possible to interact site B from within a pipeline object. I prefer the latter, thinking that it decouples the scraping of 2 sites, but I'm not sure if this is possible or the best way to handle this use case. Another approach might be to use a second spider (spider '2') for site B, but then I would assume that I would have to let spider '1' run, save to db then run spider '2' . Anyway any advice would be appreciated.
Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.
You can add url field to your item and schedule and parse it later in the pipeline:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
extra_url = item.get('extra_url', None)
if not extra_url:
return item
req = Request(url=extra_url
callback=self.custom_callback,
meta={'item': item},)
self.crawler.engine.crawl(req, spider)
# you have to drop the item here since you will return it later anyway
raise DropItem()
def custom_callback(self, response):
# retrieve your item
item = response.mete['item']
# do something to add to item
item['some_extra_stuff'] = ...
del item['extra_url']
yield item
What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.

Categories