Scrapy why bother with Items when you can just directly insert? - python

I will be using scrapy to crawl a domain. I plan to store all that information into my db with sqlalchemy. It's pretty simple xpath selectors per page, and I plan to use HttpCacheMiddleware.
In theory, I can just insert data into my db as soon as I have data from the spiders (this requires hxs to be instantiated at least). This will allow me to bypass instantiating any Item subclasses so there won't be any items to go through my pipelines.
I see the advantages of doing so as:
Less CPU intensive since there won't be any CPU processing for the pipelines
Prevents memory leaks.
Disk I/O is a lot faster than Network I/O so I don't think this will impact the spiders a lot.
Is there a reason why I would want to use Scrapy's Item class?

If you insert directly inside a spider, then your spider will block until the data is inserted. If you create an Item and pass it to the Pipeline, the spider can continue to crawl while the data is inserted. Also, there might be race conditions if multiple spiders try to insert data at the same time.

This is an old question, but I feel the upvoted answer isn't really correct.
Is there a reason why I would want to use Scrapy's Item class?
The Scrapy model of web scraping is essentially:
Collecting data with spiders.
Bundling that data into items.
Processing those items with item pipelines.
Storing those items somewhere with yet another item pipeline.
Steps 3 and 4 comprise the "big" item pipeline. If you don't subclass Item, you cannot enter an object into the item pipeline, so you're forced to normalize the fields and insert the item into your database all within your spider.
If you do subclass Item, you can make your item processing code much more maintainable:
from scrapy.item import Item, Field
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, Identity
class Product(Item):
name = Field()
price = Field()
aisle = Field()
categories = Field()
class ProductLoader(XPathItemLoader):
default_item_class = Product
price_in = MapCompose(parse_price)
categories_out = Identity()

Related

In which file/place should Scrapy process the data?

Scrapy has several points/places where allowed processing scraped data: spider, items and spider middlewares. But I don't understand where I should do it right. I can process some scraped data in all these places. Could you explain to me differences between them in detail?
For example: downloader middleware returns some data to the spider (number, short string, url, a lot of HTML, list and other). And what and where i should do with them? I understand what to do, but is not clear where to do it...
Spiders are the main point where you define how to extract data, as items. When in doubt, implement your extraction logic in your spider only, and forget about the other Scrapy features.
Item loaders, item pipelines, downloader middlewares, spider middlewares and extensions are used mainly for code sharing in scraping projects that have several spiders.
If you ever find yourself repeating the same code in two or more spiders, and you decide to stop repeating yourself, then you should go into those components and choose which ones to use to simplify your codebase my moving existing, duplicate code into one or more components of these types.
It is generally a better approach than simply using class inheritance between Spider subclasses.
As to how to use each component:
Item loaders are for shared extraction logic (e.g. XPath and CSS selectors, regular expressions), as well as pre- and post-processing of field values.
For example:
If you were writing spiders for websites that use some standard way of tagging the data to extract, like schema.org, you could write extraction logic on an item loader and reuse it across spiders.
If you want to switch the value of an item field to uppercase always, you would use an output processor on the item loader class, and reuse that item loader across spiders.
Item pipelines are for post-processing of items (not just item data in a specific item).
Common use cases include dropping duplicate items (by keeping track of uniquely-identifying data of every item parsed) or sending items to database servers or other forms of storage (as a flexible alternative to feed exports).
Downloader middlewares are used for shared logic regarding the handling of request of responses.
Common use cases include implementing anti-bot software detection and handling or proxy handling. (built-in downloader middlewares)
Spider middlewares are used for any other shared logic between spiders. It is the closes to a spider base class that there is. It can handle exceptions from spiders, the initial requests, etc. (built-in spider middlewares)
Extensions are used for more general changes to Scrapy itself. (built-in extensions)
I will try to explain in order
Spider is the one where you decide which URLs to make requests to
DownloadMiddleware has a process_request method which is called before a request to URL is made, and it has process_response method which is called once response from that URL is received
Pipeline is the thing where data is sent when you yield a dictionary from your Spider

Scrapy - Drop item field in pipeline ?

So I have an item['html'] field that is needed for MyExamplePipeline, but after processing it isn't needed to store into a database with i.e, MongoDBPipeline. Is there a way in scrapy to just drop the field html and keep the rest of the item? It's needed as part of the item to pass the page html from the spider to the pipeline, but I'm not able to figure out how to drop it. I looked in this SO post that mentioned using FEED_EXPORT_FIELDS OR fields_to_export, but the problem is that I don't want to use an item exporter, I just want to feed the item into the next MongoDBPipeline. Is there a way to do this in Scrapy? Thanks!
You can easily do that. In your MongoDBPipeline you need to do something like below
del item['html']
If that impacts the item in another pipeline then use copy.deepcopy and create a copy of item object and then delete html before inserting into mongodb

Scrapy | Automation and Writing to Excel

I have been searching how to automate and write files to Excel in Scrapy (CSV). And so far, the only doable command is the tedious, manual method of:
scrapy crawl myscript -o myscript.csv -t csv
I want to be able to format each of these into a more collected "row" format. Further more, is there any way I can make the scraper automated? Ideally, I want the code to run once per day, and I want to be able to notify myself when there has been an update regarding my scrape. With update being a relevant post.
My spider is working, and here is the code:
import scrapy
from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem
class Spider(XMLFeedSpider):
name = "Test"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=GOOGL',)
itertag = 'item'
def parse_node(self, response, node):
item = {}
item['title'] = node.xpath('title/text()',).extract_first()
item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
item['link'] = node.xpath('link/text()').extract_first()
item['description'] = node.xpath('description/text()').extract_first()
return item
I am aware that to further export/organize my scraper, I have to edit the pipeline settings (at least according to a big majority of articles I have read).
Below is my pipelines.py code:
class YahooscrapePipeline(object):
def process_item(self, item, spider):
return item
How can I set it so I can just execute the code, and it'll automatically write the code?
Update: I am using ScrapingHubs API, which runs off of shub-module to host my spider. It is very convenient, and easy to use.
Scrapy itself does not handle periodic execution or scheduling. It is completely out of scrapy's scope. I'm afraid the answer will not be as simple as you want but is what's needed.
What you CAN do is:
Use celerybeat to allow scheduling based on a crontab schedule. Running Celery tasks periodically (without Django) and http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html should get you started.
The other thing I suggest is that you host your spider in scrapyd. That will buy you log retention and a nice json api to use when you get more advanced:).
The stack overflow link gives you sample code for running celery without django (as a lot of examples assume django:) ). Remember to run the beat scheduler and not the task directly -- as pointed out in the link.
As to your question referring to organizing the output of your seat, taking note the fact that you mentioned that you are familiar of how to use exporters, but then agree that create a custom CSV exporter and then have to register the fields to export in your settings.. The order they appear in your settings did the order they will be written into the CSV file.
If I misunderstood this part of the question and instead of horizontal you mean vertical alignment of your items, if you don't have many fields... Done correctly, quick hack add the regular expression \n for new line in your spider itemization... Would probably have to first to find the items 2 then add the new line OR \t for tab of which then you can add in items with what you have to find... I give an example but this being such a hacky thing to do... I'll spare you asinine.
As to schedule a spider..
Like they have mentioned there is Scrapyd of which i use together with scrapymon... But be warned, as of this moment Scrappyd has some compatibility issues so please do remember and force yourself to create a virtual environment for your scrapyd projects.
There's a huge learning curve to getting scrapyd how you want it..
Using Django with celery is byfar TOP solution when your scraping gets serious.... Much higher learning curve is now you have to deal with server stuff, even more pain in the butt it's not a local server but old man... The speed of the cross and then custom integration or alteration of a web-based gui.If you don't want to mess with all that. What I did for a long time was used scrapinghub... get aquiated with their API... you can curl or use python modules they provide... and cron schedlue your spiders as you see fit right from you pc... scrape is done remotely so you keep resource power.

scrapy - How to insert hierarchical items to database?

I'm asking about scrapy framework.
I'm scraping a business page. First I add Brand Item (I have Brands table) with business name, then I want add several business locations (BusinessLocations table) but I need database BrandId to insert business location to database. Then I add few records about departments for each business location and again I need database BusinessLocationId to insert each Department.
Let's assume I insert Items to database in pipeline.
Can I simply assume that items processed earlier already left pipeline and are in database? In this case I can simply select needed Id's from database using some text unique field passed via meta data.
However I suppose there might be race condition since scrapy process multiple requests simultaneously. By race condition I mean that BusinessLocation item is being added before appropriate Brand is inserted to database. Is there a risk of that kind of race condition?
Can I simply assume that items processed earlier already left pipeline and are in database?
Generally, no.
It highly highly on what you do in the pipeline. For example, if you use the images pipeline, then the items with images will be hold by the images pipeline until all the images are retrieved meanwhile an item with no images or very few images will pass to the next pipeline before the previous item.
You could collect the sub-items in the main item object passing around the item to the sub-requests, but then you will have to care about whether to handle the errors to not lose an incomplete item. Another approach can be storing the items in a staging database and later consolidate looking for orphan records.
I found solution how to wait, until all data are scraped - close_spider method of pipeline is called after spider is closed.
class BlsPipeline100(object):
def __init__(self):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
processAllItems()
Now I can create hierarchy having all items accessible.

Scrapy: best way to select urls based on mysql

I made a Scrapy crawler that collects some data from forum threads. On the list page, i can see the last modified date.
Based on that date, i want to decide whether to crawl the thread again or not. I store the data in mysql, using pipeline. While processing the list page with my CrawlSpider, i want to check a record in the mysql, and based on that record i either want to yield a Request or not. (I DO NOT want to load the url unless there is a new post.)
Whats the best way to do this?
Use CrawSpider Rule:
Rule(SgmlLinkExtractor(), follow=True, process_request='check_moddate'),
Then in your spider:
def check_moddate(self, request):
def dateisnew():
# check the date
if dateisnew():
return request

Categories