I have 2 spiders and run it here:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
process1 = CrawlerProcess(settings)
process1.crawl('spider1')
process1.crawl('spider2')
process1.start()
and I want these spiders write a common file.
This is Pipeline class:
class FilePipeline(object):
def __init__(self):
self.file = codecs.open('data.txt', 'w', encoding='utf-8')
self.spiders = []
def open_spider(self, spider):
self.spiders.append(spider.name)
def process_item(self, item, spider):
line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.spiders.remove(spider.name)
if len(self.spiders) == 0:
self.file.close()
But although I don't get error message, when all spiders are done writing in the common file i have less lines (item) than the scrapy log does. A few lines are cut. Maybe there is some practice writing in one file simultaneously from two spiders?
UPDATE:
Thanks, everybody!)
I implemented it this way:
class FilePipeline1(object):
lock = threading.Lock()
datafile = codecs.open('myfile.txt', 'w', encoding='utf-8')
def __init__(self):
pass
def open_spider(self, spider):
pass
def process_item(self, item, spider):
line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
try:
FilePipeline1.lock.acquire()
if isinstance(item, VehicleItem):
FilePipeline1.datafile.write(line)
except:
pass
finally:
FilePipeline1.lock.release()
return item
def spider_closed(self, spider):
pass
I agree with A. Abramov's answer.
Here is just an idea I had. You could create two tables in a DB of your choice and then merge them after both spiders are done crawling. You would have to keep track of the time the logs came in so you can order your logs based on time received. You could then dump the db into whatever file type you would like. This way, the program doesn't have to wait for one process to complete before writing to the file and you don't have to do any multithreaded programming.
UPDATE:
Actually, depending on how long your spiders are running, you could just store the log output and the time into a dictionary. Where the time are the keys and log output are the values. This would be easier than initializing a db. You could then dump the dict into your file in order by keys.
Both of the spiders you have in seperate threads write to the file simultaniously. That will lead to problems such as the lines cutting out and some of them missing if you dont take care of syncronization, as the past says. In order to do it, you need to either synchronize file access and only write whole record/lines, or to have a strategy for allocating regions of the file to different threads e.g. re-building a file with known offsets and sizes, and by default you have neither of these. Generally, writing in the same time from two different threads into the same file is not a common method, and unless you really know what you're doing, I dont advise you to do so.
Instead, i'd seperate the spiders IO functions, and wait for one's action to finish before I start the other - considering your threads arn't syncronized, it will both make the program more efficient & make it work :) If you want a code example of how to do this in your context, just ask for it and I'll happily provide it.
Related
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Above code is from Scrapy official website: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
which is used for filtering duplicates.
And as Scrapy documentation suggested, http://doc.scrapy.org/en/latest/topics/jobs.html
To pause and resume a spider, I need to use the Jobs system.
So I'm curious if the Scrapy Jobs system can make duplicates filter persistent in its directory. The way that implements the duplicates filter is so simple that I'm in doubt.
You just need to implement your pipeline so that it reads the JOBDIR setting and, when that setting is defined, your pipeline:
Reads the initial value of self.ids_seen from some file inside the JOBDIR directory.
At run time, it updates that file as new IDs are added to the set.
I was trying to use the following function to wait for a crawler to finish and return all results. However, this function always returns immediately when called while the crawler is still running. What am I missing here? Aren't join() supposed to wait?
def spider_results():
runner = CrawlerRunner(get_project_settings())
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
runner.crawl(QuotesSpider)
runner.join()
return results
Accordig to scrapy docs (common practices section)
CrawlerProcess class is recommended to use in this cases.
I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine.
I've attempted to follow this fairly thorough answer here:
How to create custom Scrapy Item Exporter?
but when I run the file I can see it gather the data in the shell but it does not output it at all.
Below I've copied in order:
Exporter,
Pipeline,
Settings,
Exporter:
from scrapy.exporters import JsonItemExporter
class XYZExport(JsonItemExporter):
def __init__(self, file, **kwargs):
super().__init__(file)
def start_exporting(self):
self.file.write(b)
def finish_exporting(self):
self.file.write(b)
I'm struggling to determine what goes in the self.file.write parentheses?
Pipeline:
from exporters import XYZExport
class XYZExport(object):
def __init__(self, file_name):
self.file_name = file_name
self.file_handle = None
#classmethod
def from_crawler(cls, crawler):
output_file_name = crawler.settings.get('FILE_NAME')
return cls(output_file_name)
def open_spider(self, spider):
print('Custom export opened')
file = open(self.file_name, 'wb')
self.file_handle = file
self.exporter = XYZExport(file)
self.exporter.start_exporting()
def close_spider(self, spider):
print('Custom Exporter closed')
self.exporter.finish_exporting()
self.file_handle.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Settings:
FILE_NAME = 'C:\Apps Ive Built\WebScrape Python\XYZ\ScrapeOutput.json'
ITEM_PIPELINES = {
'XYZ.pipelines.XYZExport' : 600,
}
I hope/am afraid its a simple omission because that seems to be my MO, but I'm very new to scraping and this is the first time I've tried to do it this way.
If there is a more stable way to export this data I'm all ears, otherwise can you tell me what I've missed, that is preventing the data from being exported? or preventing the exporter from being properly called.
[Edited to change the pipeline name in settings]
I have use the mongodb to store the data of the crawl.
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
I want only one connection object to take the database operation, which is in pipeline.
So, I want to know how can I get the pipeline object(not new one) in the spider.
Or, any better solution for incremental update...
Thanks in advance.
Sorry, for my poor english...
Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
And the spider:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
Ok, just don't want to new a new object....I admit I am an OCD..
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
Then, in the spider:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.
According to the scrapy Architecture Overview:
The Item Pipeline is responsible for processing the items once they
have been extracted (or scraped) by the spiders.
Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.
One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.
Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.
Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.
Hope at least this information helped.
I wrote a spider using scrapy, one that makes a whole bunch of HtmlXPathSelector Requests to separate sites. It creates a row of data in a .csv file after each request is (asynchronously) satisfied. It's impossible to see which request is satisfied last, because the request is repeated if no data was extracted yet (occasionally it misses the data a few times). Even though I start with a neat list, the output is jumbled because the rows are written immediately after data is extracted.
Now I'd like to sort that list based on one column, but after every request is done. Can the 'spider_closed' signal be used to trigger a real function? As below, I tried connecting the signal with dispatcher, but this function seems to only print out things, rather than work with variables or even call other functions.
def start_requests(self)
... dispatcher.connect(self.spider_closed, signal=signals.engine_stopped) ....
def spider_closed(spider):
print 'this gets printed alright' # <-only if the next line is omitted...
out = self.AnotherFunction(in) # <-This doesn't seem to run
I hacked together a pipeline to solve this problem for you.
file: Project.middleware_module.SortedCSVPipeline
import csv
from scrapy import signals
class SortedCSVPipeline(object):
def __init__(self):
self.items = []
self.file_name = r'YOUR_FILE_PATH_HERE'
self.key = 'YOUR_KEY_HERE'
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_closed(self, spider):
for item in sorted(self.items, key=lambda k: k[self.key]):
self.write_to_csv(item)
def process_item(self, item, spider):
self.items.append(item)
return item
def write_to_csv(self, item):
writer = csv.writer(open(self.file_name, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
file: settings.py
ITEM_PIPELINES = {"Project.middleware_module.SortedCSVPipeline.SortedCSVPipeline" : 1000}
When running this you won't need to use an item exporter anymore because this pipeline will do the csv writing for you. Also, the 1000 in the pipeline entry in your setting needs to be a higher value than all other pipelines that you want to run before this one. I tested this in my project and it resulted in a csv file sorted by the column I specified! HTH
Cheers