In which file/place should Scrapy process the data? - python

Scrapy has several points/places where allowed processing scraped data: spider, items and spider middlewares. But I don't understand where I should do it right. I can process some scraped data in all these places. Could you explain to me differences between them in detail?
For example: downloader middleware returns some data to the spider (number, short string, url, a lot of HTML, list and other). And what and where i should do with them? I understand what to do, but is not clear where to do it...

Spiders are the main point where you define how to extract data, as items. When in doubt, implement your extraction logic in your spider only, and forget about the other Scrapy features.
Item loaders, item pipelines, downloader middlewares, spider middlewares and extensions are used mainly for code sharing in scraping projects that have several spiders.
If you ever find yourself repeating the same code in two or more spiders, and you decide to stop repeating yourself, then you should go into those components and choose which ones to use to simplify your codebase my moving existing, duplicate code into one or more components of these types.
It is generally a better approach than simply using class inheritance between Spider subclasses.
As to how to use each component:
Item loaders are for shared extraction logic (e.g. XPath and CSS selectors, regular expressions), as well as pre- and post-processing of field values.
For example:
If you were writing spiders for websites that use some standard way of tagging the data to extract, like schema.org, you could write extraction logic on an item loader and reuse it across spiders.
If you want to switch the value of an item field to uppercase always, you would use an output processor on the item loader class, and reuse that item loader across spiders.
Item pipelines are for post-processing of items (not just item data in a specific item).
Common use cases include dropping duplicate items (by keeping track of uniquely-identifying data of every item parsed) or sending items to database servers or other forms of storage (as a flexible alternative to feed exports).
Downloader middlewares are used for shared logic regarding the handling of request of responses.
Common use cases include implementing anti-bot software detection and handling or proxy handling. (built-in downloader middlewares)
Spider middlewares are used for any other shared logic between spiders. It is the closes to a spider base class that there is. It can handle exceptions from spiders, the initial requests, etc. (built-in spider middlewares)
Extensions are used for more general changes to Scrapy itself. (built-in extensions)

I will try to explain in order
Spider is the one where you decide which URLs to make requests to
DownloadMiddleware has a process_request method which is called before a request to URL is made, and it has process_response method which is called once response from that URL is received
Pipeline is the thing where data is sent when you yield a dictionary from your Spider

Related

How to website crawl with recursion in parallel?

I am trying to parallelize scraping a website using BeautifulSoup in Python.
I give a url and a depth variable to a function and it looks something like this :
def recursive_crawl(url, depth):
if depth == 0:
return
links = fetch_links(url) # where I use the requests library
print("Level : ", depth, url)
for link in links:
if is_link_valid(link) is True:
recursive_crawl( URL , depth - 1)
The output is partially like :
Level : 4 site.com
Level : 3 site.com/
Level : 2 site.com/research/
Level : 1 site.com/blog/
Level : 1 site.com/stories/
Level : 1 site.com/list-100/
Level : 1 site.com/cdn-cgi/l/email-protection
and so on.
My problem is this:
I am using a set(), to avoid going to already visited links so I have a shared memory problem. Can I implement this recursive web crawler in parallel ?
Note: Please don't recommend scrapy, I want it done with a parsing library like BeautifulSoup.
I don't think the parsing library you use matters very much. It seems that what you're asking about is how to manage the threads. Here's my general approach:
Establish a shared queue of URLs to be visited. Although the main contents are URL strings, you probably want to wrap those with some supplemental information: depth, referring links, and any other contextual information your spider's going to want.
Build a gatekeeper object which maintains a list of already-visited URLs. (That set is the one you've mentioned.) The object has a method which takes a URL and decides whether to add it to the Queue in #1. (Submitted URLs are also added to the set. You might also strip the URLs of GET parameters before adding it.) During setup / instantiation, it might take parameters which limit the crawl for other reasons. Depending on the variety of Queue you've chosen, you might also do some prioritization in this object. Since you're using this gatekeeper to wrap the Queue's input, you probably also want to wrap the Queue's output.
Launch several worker Threads. You'll want to make sure each Thread is instantiated with a parameter referencing the single gatekeeper instance. That Thread contains a loop with all your page-scraping code. Each iteration of the loop consumes one URL from the gatekeeper Queue, and submits any discovered URLs to the gatekeeper. (The thread doesn't care whether the URLs are already in the queue. That responsibility belongs to the gatekeeper.)

Design Question: Best 'place' to parse Scrapy text in Items?

This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here

How to set same cache folder for different spiders, now scrapy creates subfolder in cache dir for each spider

I have spiders running on same domain, second spiders run depends on results of first spider, and I would like they share cache information, but in cache folder they create subfolders with spiders names, is that possible to set same folder for them? Maybe scrapy has cache, that don't use different folders for different spiders (and support compressing like 'scrapy.extensions.httpcache.FilesystemCacheStorage'? Looks like levelDB and DBM also uses spider names for some soft of "subfoldering")
Also, if I somehow do so, presumable by removing spider.name in os.path join in httpcache.py for FilesystemCacheStorage (or change it to scrapy project name):
def _get_request_path(self, spider, request):
key = request_fingerprint(request)
return os.path.join(self.cachedir, spider.name, key[0:2], key)
would not any meta/spider specific information will prevent them from reuse cache info?
Long read version (maybe I have bad approach at all): or maybe I do it all wrong and for multiple runs on some intersections of links from domain I should consider use pipeline?
I scrape:
menu_1/subelements_1/subelements_1_2/items_set_1 in spider1
and then
menu_2/subelements_2/subelements_2_2/items_set_2 in spider2,
but items_set_1 have interconnection with about 40% of items_set_2 (i.e. same items, like universal_item_id are same) and in this case I don't need them (items from items_set_1) in items_set_2, and I can found out that I don't need this item in spider2, cause spider1 has this data, only when I finally get item, so I've got folder with 300mb gzipped cache data for spider1, and gzipped cache data for spider2 (and I like: "oh, we have this universal_item_id in items_set_1, so we don't yield this item in spider2") , from which ~40% of space I downloaded twice. Cause they cached in different subfolders.
You should try to just subclass scrapy.extensions.httpcache.FilesystemCacheStorage and override the _get_request_path to use a single folder (see an example here https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/cache.py). Default cache request fingerprint does not take meta into consideration, only url/body/method and I believe headers but only if configured to do so, not by default.
Don't forget to specify your class in HTTPCACHE_STORAGE config.

Scrapy | Automation and Writing to Excel

I have been searching how to automate and write files to Excel in Scrapy (CSV). And so far, the only doable command is the tedious, manual method of:
scrapy crawl myscript -o myscript.csv -t csv
I want to be able to format each of these into a more collected "row" format. Further more, is there any way I can make the scraper automated? Ideally, I want the code to run once per day, and I want to be able to notify myself when there has been an update regarding my scrape. With update being a relevant post.
My spider is working, and here is the code:
import scrapy
from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem
class Spider(XMLFeedSpider):
name = "Test"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=GOOGL',)
itertag = 'item'
def parse_node(self, response, node):
item = {}
item['title'] = node.xpath('title/text()',).extract_first()
item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
item['link'] = node.xpath('link/text()').extract_first()
item['description'] = node.xpath('description/text()').extract_first()
return item
I am aware that to further export/organize my scraper, I have to edit the pipeline settings (at least according to a big majority of articles I have read).
Below is my pipelines.py code:
class YahooscrapePipeline(object):
def process_item(self, item, spider):
return item
How can I set it so I can just execute the code, and it'll automatically write the code?
Update: I am using ScrapingHubs API, which runs off of shub-module to host my spider. It is very convenient, and easy to use.
Scrapy itself does not handle periodic execution or scheduling. It is completely out of scrapy's scope. I'm afraid the answer will not be as simple as you want but is what's needed.
What you CAN do is:
Use celerybeat to allow scheduling based on a crontab schedule. Running Celery tasks periodically (without Django) and http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html should get you started.
The other thing I suggest is that you host your spider in scrapyd. That will buy you log retention and a nice json api to use when you get more advanced:).
The stack overflow link gives you sample code for running celery without django (as a lot of examples assume django:) ). Remember to run the beat scheduler and not the task directly -- as pointed out in the link.
As to your question referring to organizing the output of your seat, taking note the fact that you mentioned that you are familiar of how to use exporters, but then agree that create a custom CSV exporter and then have to register the fields to export in your settings.. The order they appear in your settings did the order they will be written into the CSV file.
If I misunderstood this part of the question and instead of horizontal you mean vertical alignment of your items, if you don't have many fields... Done correctly, quick hack add the regular expression \n for new line in your spider itemization... Would probably have to first to find the items 2 then add the new line OR \t for tab of which then you can add in items with what you have to find... I give an example but this being such a hacky thing to do... I'll spare you asinine.
As to schedule a spider..
Like they have mentioned there is Scrapyd of which i use together with scrapymon... But be warned, as of this moment Scrappyd has some compatibility issues so please do remember and force yourself to create a virtual environment for your scrapyd projects.
There's a huge learning curve to getting scrapyd how you want it..
Using Django with celery is byfar TOP solution when your scraping gets serious.... Much higher learning curve is now you have to deal with server stuff, even more pain in the butt it's not a local server but old man... The speed of the cross and then custom integration or alteration of a web-based gui.If you don't want to mess with all that. What I did for a long time was used scrapinghub... get aquiated with their API... you can curl or use python modules they provide... and cron schedlue your spiders as you see fit right from you pc... scrape is done remotely so you keep resource power.

How to see/edit/Avoid duplicates in scrapy?

I was just wondering how I can reset the dupefilter process to avoid a certain number of url to be filtered.
Indeed, I tested a crawler many times before succeeding, and now that I want to run it with something like scrapy crawl quotes -o test_new.csv -s JOBDIR=crawls/quotes_new-1
It keeps telling me that some url are duplicated and then not visited..
Would be definitely OK to remove all url from that crawler
Would appreciate to know where the duplicate url are filtered (then I could edit?)
The request No-filter is not possible with my problem because it will loop
I can add my code but as it's a general question I felt it would be more confusing than anything. Just ask if you need it :)
Thank you very much,
You can set scrapys DUPEFILTER_CLASS setting with your own dupefilter class or just extend the default RFPDupeFilter(source code) class with your changes.
This documentation pages explains a bit more:
The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function.
In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string).

Categories