I'm asking about scrapy framework.
I'm scraping a business page. First I add Brand Item (I have Brands table) with business name, then I want add several business locations (BusinessLocations table) but I need database BrandId to insert business location to database. Then I add few records about departments for each business location and again I need database BusinessLocationId to insert each Department.
Let's assume I insert Items to database in pipeline.
Can I simply assume that items processed earlier already left pipeline and are in database? In this case I can simply select needed Id's from database using some text unique field passed via meta data.
However I suppose there might be race condition since scrapy process multiple requests simultaneously. By race condition I mean that BusinessLocation item is being added before appropriate Brand is inserted to database. Is there a risk of that kind of race condition?
Can I simply assume that items processed earlier already left pipeline and are in database?
Generally, no.
It highly highly on what you do in the pipeline. For example, if you use the images pipeline, then the items with images will be hold by the images pipeline until all the images are retrieved meanwhile an item with no images or very few images will pass to the next pipeline before the previous item.
You could collect the sub-items in the main item object passing around the item to the sub-requests, but then you will have to care about whether to handle the errors to not lose an incomplete item. Another approach can be storing the items in a staging database and later consolidate looking for orphan records.
I found solution how to wait, until all data are scraped - close_spider method of pipeline is called after spider is closed.
class BlsPipeline100(object):
def __init__(self):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
processAllItems()
Now I can create hierarchy having all items accessible.
Related
I have done lots of searching, but been unable to find a satisfactory answer to the most efficient approach to achieve the following.
Say my App contains a list of Products. At the end of every day an external service is called that returns another list of Products from a master data source.
If the list of Products from master data contains any Products not in my App, add the Product to the App.
If the Product in the master data is already in my App, and no changes have been made, do nothing.
If the Product in the master data is already in my App, but some data has changed (the Product's name for instance), update the Product.
If a Product is available in my App, but no longer in the master data source, flag it as "Unavailable" in the App.
At the moment, I do a loop on each list, looping through the other list for each Product:
For each Product in the master data list, loop through the Products in the App, and update as needed. If no Product was found, then add the Product to the App.
Then, for each Product in the App, loop through the Products in the master data list, and if not found, flag as "Unavailable" in the App.
I'm wondering if there is a more efficient method to achieve this? Or any algorithms or patterns that are relevant here?
In each case the Products are represented by objects in a Python list.
First of all I'd suggest to use dicts with the Product code (or name or whatever) as key and the Product object as value. This should make your loops faster by at least a 100x factor on a thousand entries.
Then especially for the second search it may be worth exploring the possibility of converting the keys of the first dict to a set and looping on the difference as in
for i in set(appDict.keys()).difference(masterDict.keys()):
##update unavailable Product data
I already have some working Spiders and code to achieve what I want, but I was looking for advice on how to consolidate things more efficiently for the project I'm working on.
My current process involves:
Within Scrapy: manually create Items using scrapy.Item
Within Scrapy: crawl, outputing each Item row to a JSON Lines (JL) file
Current Pipeline:
#pipelines.py
class MyPipeline(object):
def process_item(self, item, spider):
for field in item.fields:
item.setdefault(field, None)
return item
Outside Scrapy w/ SQL Alchemy: truncate incoming table, bulk insert JL file using Pandas to_sql
Outside Scrapy w/ SQL Alchemy: update incoming table row_uuid column (md5 hash of all pertinent columns)
Outside Scrapy w/ SQL Alchemy: upsert (insert...on conflict...where row_uuid is distinct from) incoming data table into source data table
Outside Scrapy w/ SQL Alchemy: delete from source table as necessary (404 errors, etc)
Ideally, I want to perform all these actions within Scrapy using a proper pipeline. I've seen dataset mentioned. Would some combination of open_spider, process_item, and close_spider help? Some questions I have:
Is it possible to populate/define Scrapy Items directly from an existing database table without listing the columns manually?
If you have multiple methods in a Spider (parse, parse_detail, etc) that each have their own Item, would the Pipeline be able to insert to the proper database table?
Is it possible to bulk insert X Items at a time, rather than one Item at a time?
Would the below be a potential approach? I assume other changes would be necessary...
#pipelines.py
class SqlPipeline(object):
def __init__(self, db_conn):
#Connect to DB here?
def open_spider(self, spider):
#Truncate specific incoming table here?
def process_item(self, item, spider):
#(Bulk) insert item row(s) into specific incoming table?
#Where would you define the table for this?
def close_spider(self, spider):
#Update row_uuid for specific incoming table?
#Do upsert and delete rows for specific source table?
#Close DB connection here?
Thanks for your help!
The pipelines in Scrapy are used to do exactly what you are saying.
Answering your questions:
Is it possible to populate/define Scrapy Items directly from an existing database table without listing the columns manually?
I don't understand the "listening the columns manually". I am going to guess that you have a table in a database with a bunch of columns. Those columns have to be defined in your items because they will be mapped to the DB. If not, how do you expect map every field into the column in the table?
If you have multiple methods in a Spider (parse, parse_detail, etc) that each have their own Item, would the Pipeline be able to insert to the proper database table?
Yes. You can define multiple pipelines (with their weight) in order to separate different processes per Item and separated properly.
Is it possible to bulk insert X Items at a time, rather than one Item at a time?
Would the below be a potential approach?
Yes. Of course! You have to define it into your pipeline. Your logic can be different in each one!
I want to use the scrapy-elasticsearch pipeline in my scrapy project. In this project I have different items / models. These items are stored in a mysql server. In addition I want to index ONE of these items in an ElasticSearchServer.
In the documentation, however, I only find the way to index all defined items like in the code example from the settings.py below.
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'
Like you can see the ELASTICSEARCH_TYPE shows that alle items have to be indexed. Is there a possibility to limit this to only one item?
The current implementation does not support sending only some items.
You could create a subclass of the original pipeline and override the process_item method to do what you want.
If you have the time, you could also send a pull request upstream with a proposal to allow filtering items before sending them to Elasticsearch.
I made a Scrapy crawler that collects some data from forum threads. On the list page, i can see the last modified date.
Based on that date, i want to decide whether to crawl the thread again or not. I store the data in mysql, using pipeline. While processing the list page with my CrawlSpider, i want to check a record in the mysql, and based on that record i either want to yield a Request or not. (I DO NOT want to load the url unless there is a new post.)
Whats the best way to do this?
Use CrawSpider Rule:
Rule(SgmlLinkExtractor(), follow=True, process_request='check_moddate'),
Then in your spider:
def check_moddate(self, request):
def dateisnew():
# check the date
if dateisnew():
return request
I will be using scrapy to crawl a domain. I plan to store all that information into my db with sqlalchemy. It's pretty simple xpath selectors per page, and I plan to use HttpCacheMiddleware.
In theory, I can just insert data into my db as soon as I have data from the spiders (this requires hxs to be instantiated at least). This will allow me to bypass instantiating any Item subclasses so there won't be any items to go through my pipelines.
I see the advantages of doing so as:
Less CPU intensive since there won't be any CPU processing for the pipelines
Prevents memory leaks.
Disk I/O is a lot faster than Network I/O so I don't think this will impact the spiders a lot.
Is there a reason why I would want to use Scrapy's Item class?
If you insert directly inside a spider, then your spider will block until the data is inserted. If you create an Item and pass it to the Pipeline, the spider can continue to crawl while the data is inserted. Also, there might be race conditions if multiple spiders try to insert data at the same time.
This is an old question, but I feel the upvoted answer isn't really correct.
Is there a reason why I would want to use Scrapy's Item class?
The Scrapy model of web scraping is essentially:
Collecting data with spiders.
Bundling that data into items.
Processing those items with item pipelines.
Storing those items somewhere with yet another item pipeline.
Steps 3 and 4 comprise the "big" item pipeline. If you don't subclass Item, you cannot enter an object into the item pipeline, so you're forced to normalize the fields and insert the item into your database all within your spider.
If you do subclass Item, you can make your item processing code much more maintainable:
from scrapy.item import Item, Field
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, Identity
class Product(Item):
name = Field()
price = Field()
aisle = Field()
categories = Field()
class ProductLoader(XPathItemLoader):
default_item_class = Product
price_in = MapCompose(parse_price)
categories_out = Identity()