I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine.
I've attempted to follow this fairly thorough answer here:
How to create custom Scrapy Item Exporter?
but when I run the file I can see it gather the data in the shell but it does not output it at all.
Below I've copied in order:
Exporter,
Pipeline,
Settings,
Exporter:
from scrapy.exporters import JsonItemExporter
class XYZExport(JsonItemExporter):
def __init__(self, file, **kwargs):
super().__init__(file)
def start_exporting(self):
self.file.write(b)
def finish_exporting(self):
self.file.write(b)
I'm struggling to determine what goes in the self.file.write parentheses?
Pipeline:
from exporters import XYZExport
class XYZExport(object):
def __init__(self, file_name):
self.file_name = file_name
self.file_handle = None
#classmethod
def from_crawler(cls, crawler):
output_file_name = crawler.settings.get('FILE_NAME')
return cls(output_file_name)
def open_spider(self, spider):
print('Custom export opened')
file = open(self.file_name, 'wb')
self.file_handle = file
self.exporter = XYZExport(file)
self.exporter.start_exporting()
def close_spider(self, spider):
print('Custom Exporter closed')
self.exporter.finish_exporting()
self.file_handle.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Settings:
FILE_NAME = 'C:\Apps Ive Built\WebScrape Python\XYZ\ScrapeOutput.json'
ITEM_PIPELINES = {
'XYZ.pipelines.XYZExport' : 600,
}
I hope/am afraid its a simple omission because that seems to be my MO, but I'm very new to scraping and this is the first time I've tried to do it this way.
If there is a more stable way to export this data I'm all ears, otherwise can you tell me what I've missed, that is preventing the data from being exported? or preventing the exporter from being properly called.
[Edited to change the pipeline name in settings]
Related
I wrote a custom pipeline to get the node names that I wanted:
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('crawl.xml', 'w',encoding='utf-8')
self.files[spider] = file
self.exporter = XmlItemExporter(file,item_element='job', root_element='jobs', indent=1)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
self.uploadftp(spider)
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Now I can't figure out how to export with FTP instead of just local storage.
To change item data, pipelines are great. And there are indeed export use cases where they also make sense (e.g. splitting items across multiple files).
To change the output format, however, it may be better to implement a custom feed exporter, register it in FEED_EXPORTERS and enable it in FEED_FORMAT.
There’s no extensive documentation about creating custom feed exporters, but if you have a look at the implementation of XmlItemExporter you should be able to figure things out.
In fact, looking at your code and XmlItemExporter’s you may simply need to subclass XmlItemExporter, change its __init__ method to pass item_element='job', root_element='jobs' to the parent __init__, and use the FEED_EXPORT_INDENT setting to define the desired indentation (1).
Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem.
Well I don't want the mysql things inside the spider and in the pipeline I get a problem.
If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message
'None Type' object has no attribute getUrl
I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console).
Has somebody an idea how to get the pipeline object inside the spider?
MySpider.py
def __init__(self):
self.pipe = None
def start_requests(self):
url = self.pipe.getUrl()
scrapy.Request(url,callback=self.parse)
Pipeline.py
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
spider.pipe = self
def getUrl(self):
...
Scrapy pipelines already have expected methods of open_spider and close_spider
Taken from docs: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider
open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened
close_spider(self, spider)
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed
However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.
What you should do is open up db and read urls in your spider itself.
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
#classmethod
def from_crawler(self, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.start_urls = self.get_urls_from_db()
return spider
def get_urls_from_db(self):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
I'm using accepted solution but not works as expected.
TypeError: get_urls_from_db() missing 1 required positional argument: 'self'
Here's the worked one from my side
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
def __init__(self, db_dsn):
self.db_dsn = db_dsn
self.start_urls = self.get_urls_from_db(db_dsn)
#classmethod
def from_crawler(cls, crawler):
spider = cls(
db_dsn=os.getenv('DB_DSN', 'mongodb://localhost:27017'),
)
spider._set_crawler(crawler)
return spider
def get_urls_from_db(self, db_dsn):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
I have 2 spiders and run it here:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
process1 = CrawlerProcess(settings)
process1.crawl('spider1')
process1.crawl('spider2')
process1.start()
and I want these spiders write a common file.
This is Pipeline class:
class FilePipeline(object):
def __init__(self):
self.file = codecs.open('data.txt', 'w', encoding='utf-8')
self.spiders = []
def open_spider(self, spider):
self.spiders.append(spider.name)
def process_item(self, item, spider):
line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.spiders.remove(spider.name)
if len(self.spiders) == 0:
self.file.close()
But although I don't get error message, when all spiders are done writing in the common file i have less lines (item) than the scrapy log does. A few lines are cut. Maybe there is some practice writing in one file simultaneously from two spiders?
UPDATE:
Thanks, everybody!)
I implemented it this way:
class FilePipeline1(object):
lock = threading.Lock()
datafile = codecs.open('myfile.txt', 'w', encoding='utf-8')
def __init__(self):
pass
def open_spider(self, spider):
pass
def process_item(self, item, spider):
line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
try:
FilePipeline1.lock.acquire()
if isinstance(item, VehicleItem):
FilePipeline1.datafile.write(line)
except:
pass
finally:
FilePipeline1.lock.release()
return item
def spider_closed(self, spider):
pass
I agree with A. Abramov's answer.
Here is just an idea I had. You could create two tables in a DB of your choice and then merge them after both spiders are done crawling. You would have to keep track of the time the logs came in so you can order your logs based on time received. You could then dump the db into whatever file type you would like. This way, the program doesn't have to wait for one process to complete before writing to the file and you don't have to do any multithreaded programming.
UPDATE:
Actually, depending on how long your spiders are running, you could just store the log output and the time into a dictionary. Where the time are the keys and log output are the values. This would be easier than initializing a db. You could then dump the dict into your file in order by keys.
Both of the spiders you have in seperate threads write to the file simultaniously. That will lead to problems such as the lines cutting out and some of them missing if you dont take care of syncronization, as the past says. In order to do it, you need to either synchronize file access and only write whole record/lines, or to have a strategy for allocating regions of the file to different threads e.g. re-building a file with known offsets and sizes, and by default you have neither of these. Generally, writing in the same time from two different threads into the same file is not a common method, and unless you really know what you're doing, I dont advise you to do so.
Instead, i'd seperate the spiders IO functions, and wait for one's action to finish before I start the other - considering your threads arn't syncronized, it will both make the program more efficient & make it work :) If you want a code example of how to do this in your context, just ask for it and I'll happily provide it.
I wrote a spider using scrapy, one that makes a whole bunch of HtmlXPathSelector Requests to separate sites. It creates a row of data in a .csv file after each request is (asynchronously) satisfied. It's impossible to see which request is satisfied last, because the request is repeated if no data was extracted yet (occasionally it misses the data a few times). Even though I start with a neat list, the output is jumbled because the rows are written immediately after data is extracted.
Now I'd like to sort that list based on one column, but after every request is done. Can the 'spider_closed' signal be used to trigger a real function? As below, I tried connecting the signal with dispatcher, but this function seems to only print out things, rather than work with variables or even call other functions.
def start_requests(self)
... dispatcher.connect(self.spider_closed, signal=signals.engine_stopped) ....
def spider_closed(spider):
print 'this gets printed alright' # <-only if the next line is omitted...
out = self.AnotherFunction(in) # <-This doesn't seem to run
I hacked together a pipeline to solve this problem for you.
file: Project.middleware_module.SortedCSVPipeline
import csv
from scrapy import signals
class SortedCSVPipeline(object):
def __init__(self):
self.items = []
self.file_name = r'YOUR_FILE_PATH_HERE'
self.key = 'YOUR_KEY_HERE'
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_closed(self, spider):
for item in sorted(self.items, key=lambda k: k[self.key]):
self.write_to_csv(item)
def process_item(self, item, spider):
self.items.append(item)
return item
def write_to_csv(self, item):
writer = csv.writer(open(self.file_name, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
file: settings.py
ITEM_PIPELINES = {"Project.middleware_module.SortedCSVPipeline.SortedCSVPipeline" : 1000}
When running this you won't need to use an item exporter anymore because this pipeline will do the csv writing for you. Also, the 1000 in the pipeline entry in your setting needs to be a higher value than all other pipelines that you want to run before this one. I tested this in my project and it resulted in a csv file sorted by the column I specified! HTH
Cheers
Is it possible to access the name of the current spider in a feed exporter?
The doc about storage URI parameters might help.
Or, if you are building your own:
The methods used by exporters support passing the spider object to it.
For example:
def open_spider(self, spider):
print spider.name
def close_spider(self, spider):
print spider.name
def item_scraped(self, item, spider):
print spider.name