Scrapy export with headers if empty - python

As far as I can see nobody has asked this question, and I'm completely stuck trying to solve it. What I have is a spider that sometimes won't return any results (either no results exist or the site could not be scraped due to, say, the robots.txt file), and this results in an empty, headerless, csv file. I have a robot looking to pick up this file, so when it is empty the robot doesn't realise it is finished and in any case without headers cannot understand it.
What I want is to output the csv file with headers every time, even if there are no results. I've tried using json instead but have the same issue - if there is an error or there are no results the file is empty.
I'd be quite happy to call something on the spider closing (for whatever reason, even an error in initialising due to say, a bad url) and writing something to the file.
Thanks.

I solved this by amending my item pipeline and not using the feed exporter in the command line. This allowed me to use close_spider to write, if no results, the header.
I'd obviously welcome any improvements if I've missed something.
Pipeline code:
from scrapy.exceptions import DropItem
from scrapy.exporters import CsvItemExporter
import csv
from Generic.items import GenericItem
class GenericPipeline:
def __init__(self):
self.emails_seen = set()
self.files = {}
def open_spider(self, spider):
self.file = open("results.csv", 'wb')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def close_spider(self, spider):
if not self.emails_seen:
header = GenericItem()
header["email"] = "None Found"
self.exporter.export_item(header)
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
if item["email"] in self.emails_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.emails_seen.add(item["email"])
self.exporter.export_item(item)
return item

Related

Item is losing name while yielding in Python / Scrapy

In Scrapy 2.4.x on Python 3.8.x I am yielding an item with the purpose to save some stats to a DB. The scraper has another Item that gets yielded as well.
While the name of the item is present in the main script "StatsItem", it is lost within the other class. I am using the name of the item to decide which method to call:
in scraper.py:
import scrapy
from crawler.items import StatsItem, OtherItem
class demo(scrapy.Spider):
def parse_item(self, response):
stats = StatsItem()
stats['results'] = 10
yield stats
print(type(stats).__name__)
# Output: StatsItem
print(stats)
# Output: {'results': 10}
in pipeline.py
import scrapy
from crawler.items import StatsItem, OtherItem
class mysql_pipeline(object):
def process_item(self, item, spider):
print(type(item).__name__)
# Output: NoneType
if isinstance(item, StatsItem):
self.save_stats(item, spider)
elif isinstance(item, OtherItem):
# call other method
return item
The output of print in the first class is "StatsItem", while it is "NoneType" within the pipeline, therefore the method save_stats() gets never called.
I am pretty new to Python, so there might be a better way of doing this. There is no error message or exception I am aware of. Any help is greatly appreciated.
You can't use yield outside of a function imo.
I was finaly able to locate the problem. The particular crawler was nearly identical to all other ones that did not have this issue but with one exception, I was custom setting the item pipeline:
custom_settings.update({
'ITEM_PIPELINES' : {
'crawler.pipelines.mysql_pipeline': 301,
}
})
Removing this, fixed the issue.

Scrapy Splash Screenshot Pipeline not working

I'm trying to save screenshots of scraped webpages with Scrapy Splash. I've copied and pasted the code found here into my pipeline folder: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Here's the code from the url:
import scrapy
import hashlib
from urllib.parse import quote
class ScreenshotPipeline(object):
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
async def process_item(self, item, spider):
encoded_item_url = quote(item["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
response = await spider.crawler.engine.download(request, spider)
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = item["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = "{}.png".format(url_hash)
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
item["screenshot_filename"] = filename
return item
I've also followed the instructions for setting up splash found here: https://github.com/scrapy-plugins/scrapy-splash
When I call the command scrapy crawl spidereverything works correctly except the pipeline.
This is the "Error" I'm seeing.
<coroutine object ScreenshotPipeline.process_item at 0x7f29a9c7c8c0>
The spider is yielding the item correctly, but it will not process the item.
Does anyone have any advice? Thank you.
Edit:
I think what is going on is that Scrapy is calling the process_item() method as you normally would. However according to these docs: https://docs.python.org/3/library/asyncio-task.html a coroutine object must be called differently.
asyncio.run(process_item()) rather than process_item().
I think I may have to modify the source code?
You should use scrapy-splash in a script inside spider not in the pipelines.
I followed this docs and it works for me.

route results from yield to a file

I have the following Python script using Scrapy:
import scrapy
class ChemSpider(scrapy.Spider):
name = "site"
def start_requests(self):
urls = [
'https://www.site.com.au'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
category_links = response.css('li').xpath('a/#href').getall()
category_links_filtered = [x for x in category_links if 'shop-online' in x] # remove non category links
category_links_filtered = list(dict.fromkeys(category_links_filtered)) # remove duplicates
for category_link in category_links_filtered:
if "medicines" in category_link:
next_page = response.urljoin(category_link) + '?size=10'
self.log(next_page)
yield scrapy.Request(next_page, callback=self.parse_subcategories)
def parse_subcategories(self, response):
for product in response.css('div.Product'):
yield {
'category_link': response.url,
'product_name': product.css('img::attr(alt)').get(),
'product_price': product.css('span.Price::text').get().replace('\n','')
}
My solution will run multiple instances of this script, each scraping a different subset of information from different 'categories'. I know you can run scrapy from the command line to output to a json file, but i want do to the output to a file from within the function, so each instance writes to a different file. Being a beginner with Python, I'm not sure where to go with my script. I need to get the output of the yield into a file while the script is executing. How do i achieve this? There will be hundreds of rows scraped, and I'm not familiar enough with how yield works to understand how to 'return' from it a set of data (or a list) that can then be written to the file.
You are looking to append a file. But being file writing an I/O operation, you need to lock the file from being written by other processes while a process is writing.
Easiest way to achieve is to write in different random files (files with random names) in a directory and concatenating them all using another process.
First let me suggest you some changes to your code. If you want to remove duplicates i you could use a set like this:
category_links_filtered = (x for x in category_links if 'shop-online' in x) # remove non category links
category_links_filtered = set(category_links_filtered) # remove duplicates
note that i'm also changing the [ to ( to make a generator instead of a list and save some memory. Search more about generators: https://www.python-course.eu/python3_generators.php
OK then the solution for your problem is using an Item Pipeline (https://docs.scrapy.org/en/latest/topics/item-pipeline.html), what this does perfom some action on every item yielded from your function parse_subcategories. What you do is add a class in your pipelines.py file and enable this pipeline in settings.py. This is:
In settings.py:
ITEM_PIPELINES = {
'YOURBOTNAME.pipelines.CategoriesPipeline': 300, #the number here is the priority of the pipeline, dont worry and just leave it
}
In pipelines.py:
import json
from urlparse import urlparse #this is library to parse urls
class CategoriesPipeline(object):
#This class dynamically saves the data depending on the category name obtained in the url or by an atrtribute
def open_spider(self, spider):
if hasattr(spider, 'filename'):
#the filename is an attribute set by -a filename=somefilename
filename = spider.filename
else:
#you could also set the name dynamically from the start url like this, if you set -a start_url=https://www.site.com.au/category-name
try:
filename = urlparse(spider.start_url).path[1:] #this returns 'category-name' and replace spaces with _
except AttributeError:
spider.crawler.engine.close_spider(self, reason='no start url') #this should not happen
self.file = open(filename+'.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
In spiders/YOURBOTNAME.py modify this:
class ChemSpider(scrapy.Spider):
name = "site"
if !hasattr(self, 'start_url'):
spider.crawler.engine.close_spider(self, reason='no start url') #we need a start url
start_urls = [ self.start_url ] #see why this works on https://docs.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
def parse(self, response):#...
and then you start your crawl with this command: scrapy crawl site -a start_url=https://www.site.com.au/category-name and you could optionally add -a filename=somename

Multi-step / nested scrapy file download

I'm trying to download file using a custom scrapy pipeline. However the file url is not trivial to obtain. Here is the steps :
pipeline get an item containing a pdfLink attribute
the page at pdfLink is a wrapper of the pdf, which is embedded in an iframe
I then extend the FilesPipeline class :
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item['pdfLink'],
callback=self.get_pdfurl)
def get_pdfurl(self, response):
import logging
logging.info('...............')
print response.url
yield scrapy.Request(response.css('iframe::attr(src)').extract()[0])
However :
files that are downloaded are the web pages pointed out by pdfLink and not the embedded pdf file.
neither the print or logging.info are shown in logs.
It then seems that the get_pdfurl is not called back. Am I doing something wrong ? How is it possible to download such a nested file ?
Found a solution by using two consecutive pipelines, where the first is build like in Item pipeline - Take screenshot of item.
class PdfWrapperPipeline(object):
def process_item(self, item, spider):
wrapper_url = self.WRAPPER_URL.format(item.get('pdfLink'))
request = scrapy.Request(item.get('pdfLink'))
dfd = spider.crawler.engine.download(request, spider)
dfd.addBoth(self.return_item, item)
return dfd
def return_item(self, response, item):
if response.status != 200:
# Error happened, return item.
return item
url = response.css('iframe::attr(src)').extract()[0]
item['pdfUrl'] = url
return item
class PdfPipeline(FilesPipeline):
def get_media_requests(self, item, spider):
yield scrapy.Request(item.get('pdfUrl'))
and then set in settings.py the wrapper pipeline priority higher than the pdf pipeline priority.
ITEM_PIPELINES = {
'project.pipelines.PdfWrapperPipeline': 1,
'project.pipelines.PdfPipeline': 2,
}
Response has been first posted in the scrapy's github

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
I use the following command line to get CSV data:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
contrib/exporter/init.py
contrib/feedexport.py
to address some previous issues, that seem to have already been resolved...
Many thanks in advance.
To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py):
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
You can now specify settings in the spider itself.
https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS.
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('#href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.
I found a pretty simple way to solve this issue. The above answers I would still say are more correct, but this is a quick fix. It turns out scrapy pulls the items in alphabetical order. Capitals are also important. So, an item beginning with 'A' will be pulled first, then 'B', 'C', etc, followed by 'a', 'b', 'c'. I have a project going right now where the header names are not extremely important, but I did need the UPC to be the first header for input into another program. I have the following item class:
class ItemInfo(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
A_UPC = scrapy.Field()
ID = scrapy.Field()
time = scrapy.Field()
My CSV file outputs with the headers (in order): A_UPC, ID, item, price, time

Categories