How to website crawl with recursion in parallel? - python

I am trying to parallelize scraping a website using BeautifulSoup in Python.
I give a url and a depth variable to a function and it looks something like this :
def recursive_crawl(url, depth):
if depth == 0:
return
links = fetch_links(url) # where I use the requests library
print("Level : ", depth, url)
for link in links:
if is_link_valid(link) is True:
recursive_crawl( URL , depth - 1)
The output is partially like :
Level : 4 site.com
Level : 3 site.com/
Level : 2 site.com/research/
Level : 1 site.com/blog/
Level : 1 site.com/stories/
Level : 1 site.com/list-100/
Level : 1 site.com/cdn-cgi/l/email-protection
and so on.
My problem is this:
I am using a set(), to avoid going to already visited links so I have a shared memory problem. Can I implement this recursive web crawler in parallel ?
Note: Please don't recommend scrapy, I want it done with a parsing library like BeautifulSoup.

I don't think the parsing library you use matters very much. It seems that what you're asking about is how to manage the threads. Here's my general approach:
Establish a shared queue of URLs to be visited. Although the main contents are URL strings, you probably want to wrap those with some supplemental information: depth, referring links, and any other contextual information your spider's going to want.
Build a gatekeeper object which maintains a list of already-visited URLs. (That set is the one you've mentioned.) The object has a method which takes a URL and decides whether to add it to the Queue in #1. (Submitted URLs are also added to the set. You might also strip the URLs of GET parameters before adding it.) During setup / instantiation, it might take parameters which limit the crawl for other reasons. Depending on the variety of Queue you've chosen, you might also do some prioritization in this object. Since you're using this gatekeeper to wrap the Queue's input, you probably also want to wrap the Queue's output.
Launch several worker Threads. You'll want to make sure each Thread is instantiated with a parameter referencing the single gatekeeper instance. That Thread contains a loop with all your page-scraping code. Each iteration of the loop consumes one URL from the gatekeeper Queue, and submits any discovered URLs to the gatekeeper. (The thread doesn't care whether the URLs are already in the queue. That responsibility belongs to the gatekeeper.)

Related

In which file/place should Scrapy process the data?

Scrapy has several points/places where allowed processing scraped data: spider, items and spider middlewares. But I don't understand where I should do it right. I can process some scraped data in all these places. Could you explain to me differences between them in detail?
For example: downloader middleware returns some data to the spider (number, short string, url, a lot of HTML, list and other). And what and where i should do with them? I understand what to do, but is not clear where to do it...
Spiders are the main point where you define how to extract data, as items. When in doubt, implement your extraction logic in your spider only, and forget about the other Scrapy features.
Item loaders, item pipelines, downloader middlewares, spider middlewares and extensions are used mainly for code sharing in scraping projects that have several spiders.
If you ever find yourself repeating the same code in two or more spiders, and you decide to stop repeating yourself, then you should go into those components and choose which ones to use to simplify your codebase my moving existing, duplicate code into one or more components of these types.
It is generally a better approach than simply using class inheritance between Spider subclasses.
As to how to use each component:
Item loaders are for shared extraction logic (e.g. XPath and CSS selectors, regular expressions), as well as pre- and post-processing of field values.
For example:
If you were writing spiders for websites that use some standard way of tagging the data to extract, like schema.org, you could write extraction logic on an item loader and reuse it across spiders.
If you want to switch the value of an item field to uppercase always, you would use an output processor on the item loader class, and reuse that item loader across spiders.
Item pipelines are for post-processing of items (not just item data in a specific item).
Common use cases include dropping duplicate items (by keeping track of uniquely-identifying data of every item parsed) or sending items to database servers or other forms of storage (as a flexible alternative to feed exports).
Downloader middlewares are used for shared logic regarding the handling of request of responses.
Common use cases include implementing anti-bot software detection and handling or proxy handling. (built-in downloader middlewares)
Spider middlewares are used for any other shared logic between spiders. It is the closes to a spider base class that there is. It can handle exceptions from spiders, the initial requests, etc. (built-in spider middlewares)
Extensions are used for more general changes to Scrapy itself. (built-in extensions)
I will try to explain in order
Spider is the one where you decide which URLs to make requests to
DownloadMiddleware has a process_request method which is called before a request to URL is made, and it has process_response method which is called once response from that URL is received
Pipeline is the thing where data is sent when you yield a dictionary from your Spider

Scrapy | Automation and Writing to Excel

I have been searching how to automate and write files to Excel in Scrapy (CSV). And so far, the only doable command is the tedious, manual method of:
scrapy crawl myscript -o myscript.csv -t csv
I want to be able to format each of these into a more collected "row" format. Further more, is there any way I can make the scraper automated? Ideally, I want the code to run once per day, and I want to be able to notify myself when there has been an update regarding my scrape. With update being a relevant post.
My spider is working, and here is the code:
import scrapy
from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem
class Spider(XMLFeedSpider):
name = "Test"
allowed_domains = ["yahoo.com"]
start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=GOOGL',)
itertag = 'item'
def parse_node(self, response, node):
item = {}
item['title'] = node.xpath('title/text()',).extract_first()
item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
item['link'] = node.xpath('link/text()').extract_first()
item['description'] = node.xpath('description/text()').extract_first()
return item
I am aware that to further export/organize my scraper, I have to edit the pipeline settings (at least according to a big majority of articles I have read).
Below is my pipelines.py code:
class YahooscrapePipeline(object):
def process_item(self, item, spider):
return item
How can I set it so I can just execute the code, and it'll automatically write the code?
Update: I am using ScrapingHubs API, which runs off of shub-module to host my spider. It is very convenient, and easy to use.
Scrapy itself does not handle periodic execution or scheduling. It is completely out of scrapy's scope. I'm afraid the answer will not be as simple as you want but is what's needed.
What you CAN do is:
Use celerybeat to allow scheduling based on a crontab schedule. Running Celery tasks periodically (without Django) and http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html should get you started.
The other thing I suggest is that you host your spider in scrapyd. That will buy you log retention and a nice json api to use when you get more advanced:).
The stack overflow link gives you sample code for running celery without django (as a lot of examples assume django:) ). Remember to run the beat scheduler and not the task directly -- as pointed out in the link.
As to your question referring to organizing the output of your seat, taking note the fact that you mentioned that you are familiar of how to use exporters, but then agree that create a custom CSV exporter and then have to register the fields to export in your settings.. The order they appear in your settings did the order they will be written into the CSV file.
If I misunderstood this part of the question and instead of horizontal you mean vertical alignment of your items, if you don't have many fields... Done correctly, quick hack add the regular expression \n for new line in your spider itemization... Would probably have to first to find the items 2 then add the new line OR \t for tab of which then you can add in items with what you have to find... I give an example but this being such a hacky thing to do... I'll spare you asinine.
As to schedule a spider..
Like they have mentioned there is Scrapyd of which i use together with scrapymon... But be warned, as of this moment Scrappyd has some compatibility issues so please do remember and force yourself to create a virtual environment for your scrapyd projects.
There's a huge learning curve to getting scrapyd how you want it..
Using Django with celery is byfar TOP solution when your scraping gets serious.... Much higher learning curve is now you have to deal with server stuff, even more pain in the butt it's not a local server but old man... The speed of the cross and then custom integration or alteration of a web-based gui.If you don't want to mess with all that. What I did for a long time was used scrapinghub... get aquiated with their API... you can curl or use python modules they provide... and cron schedlue your spiders as you see fit right from you pc... scrape is done remotely so you keep resource power.

How to setup a 3-tier web application project

EDIT:
I have added [MVC] and [design-patterns] tags to expand the audience for this question as it is more of a generic programming question than something that has direclty to do with Python or SQLalchemy. It applies to all applications with business logic and an ORM.
The basic question is if it is better to keep business logic in separate modules, or to add it to the classes that our ORM provides:
We have a flask/sqlalchemy project for which we have to setup a structure to work in. There are two valid opinions on how to set things up, and before the project really starts taking off we would like to make our minds up on one of them.
If any of you could give us some insights on which of the two would make more sense and why, and what the advantages/disadvantages would be, it would be greatly appreciated.
My example is an HTML letter that needs to be sent in bulk and/or displayed to a single user. The letter can have sections that display an invoice and/or a list of articles for the user it is addressed to.
Method 1:
Split the code into 3 tiers - 1st tier: web interface, 2nd tier: processing of the letter, 3rd tier: the models from the ORM (sqlalchemy).
The website will call a server side method in a class in the 2nd tier, the 2nd tier will loop through the users that need to get this letter and it will have internal methods that generate the HTML and replace some generic fields in the letter, with information for the current user. It also has internal methods to generate an invoice or a list of articles to be placed in the letter.
In this method, the 3rd tier is only used for fetching data from the database and perhaps some database related logic like generating a full name from a users' first name and last name. The 2nd tier performs most of the work.
Method 2:
Split the code into the same three tiers, but only perform the loop through the collection of users in the 2nd tier.
The methods for generating HTML, invoices and lists of articles are all added as methods to the model definitions in tier 3 that the ORM provides. The 2nd tier performs the loop, but the actual functionality is enclosed in the model classes in the 3rd tier.
We concluded that both methods could work, and both have pros and cons:
Method 1:
separates business logic completely from database access
prevents that importing an ORM model also imports a lot of methods/functionality that we might not need, also keeps the code for the model classes more compact.
might be easier to use when mocking out ORM models for testing
Method 2:
seems to be in line with the way Django does things in Python
allows simple access to methods: when a model instance is present, any function it
performs can be immediately called. (in my example: when I have a letter-instance available, I can directly call a method on it that generates the HTML for that letter)
you can pass instances around, having all appropriate methods at hand.
Normally, you use the MVC pattern for this kind of stuff, but most web frameworks in python have dropped the "Controller" part for since they believe that it is an unnecessary component. In my development I have realized, that this is somewhat true: I can live without it. That would leave you with two layers: The view and the model.
The question is where to put business logic now. In a practical sense, there are two ways of doing this, at least two ways in which I am confrontet with where to put logic:
Create special internal view methods that handle logic, that might be needed in more than one view, e.g. _process_list_data
Create functions that are related to a model, but not directly tied to a single instance inside a corresponding model module, e.g. check_login.
To elaborate: I use the first one for strictly display-related methods, i.e. they are somehow concerned with processing data for displaying purposes. My above example, _process_list_data lives inside a view class (which groups methods by purpose), but could also be a normal function in a module. It recieves some parameters, e.g. the data list and somehow formats it (for example it may add additional view parameters so the template can have less logic). It then returns the data set to the original view function which can either pass it along or process it further.
The second one is used for most other logic which I like to keep out of my direct view code for easier testing. My example of check_login does this: It is a function that is not directly tied to display output as its purpose is to check the users login credentials and decide to either return a user or report a login failure (by throwing an exception, return False or returning None). However, this functionality is not directly tied to a model either, so it cannot live inside an ORM class (well it could be a staticmethod for the User object). Instead it is just a function inside a module (remember, this is Python, you should use the simplest approach available, and functions are there for something)
To sum this up: Display logic in the view, all the other stuff in the model, since most logic is somehow tied to specific models. And if it is not, create a new module or package just for logic of this kind. This could be a separate module or even a package. For example, I often create a util module/package for helper functions, that are not directly tied for any view, model or else, for example a function to format dates that is called from the template but contains so much python could it would be ugly being defined inside a template.
Now we bring this logic to your task: Processing/Creation of letters. Since I don't know exactly what processing needs to be done, I can only give general recommendations based on my assumptions.
Let's say you have some data and want to bring it into a letter. So for example you have a list of articles and a costumer who bought these articles. In that case, you already have the data. The only thing that may need to be done before passing it to the template is reformatting it in such a way that the template can easily use it. For example it may be desired to order the purchased articles, for example by the amount, the price or the article number. This is something that is independent of the model, the order is now only display related (you could have specified the order already in your database query, but let's assume you didn't). In this case, this is an operation your view would do, so your template has the data ready formatted to be displayed.
Now let's say you want to get the data to create a specifc letter, for example a list of articles the user bough over time, together with the date when they were bought and other details. This would be the model's job, e.g. create a query, fetch the data and make sure it is has all the properties required for this specifc task.
Let's say in both cases you with to retrieve a price for the product and that price is determined by a base value and some percentages based on other properties: This would make sense as a model method, as it operates on a single product or order instance. You would then pass the model to the template and call the price method inside it. But you might as well reformat it in such a way, that the call is made already in the view and the template only gets tuples or dictionaries. This would make it easier to pass the same data out as an API (see below) but it might not necessarily be the easiest/best way.
A good rule for this decision is to ask yourself If I were to provide a JSON API additionally to my standard view, how would I need to modify my code to be as DRY as possible?. If theoretical is not enough at the start, build some APIs for the templates and see where you need to change things to the API makes sense next to the views themselves. You may never use this API and so it does not need to be perfect, but it can help you figure out how to structure your code. However, as you saw above, this doesn't necessarily mean that you should do preprocessing of the data in such a way that you only return things that can be turned into JSON, instead you might want to make some JSON specifc formatting for the API view.
So I went on a little longer than I intended, but I wanted to provide some examples to you because that is what I missed when I started and found out those things via trial and error.

Dynamically add URL rules to Flask app

I am writing an app in which users will be able to store information that they can specify a REST interface for. IE, store a list of products at /<username>/rest/products. Since the URLs are obviously not known before hand, I was trying to think of the best way to implement dynamic URL creation in Flask. The first way I thought of would be to write a catch-all rule, and route the URL from there. But then I am basically duplicating URL routing capabilities when Flask already has them built-in. So, I was wondering if it would be a bad idea to use .add_url_rule() (docs here, scroll down a bit) to attach them directly to the app. Is there a specific reason this shouldn't be done?
Every time you execute add_url_rule() the internal routing remaps the URL map. This is neither threadsafe nor fast. I right now don't understand why you need user specific URL rules to be honest. It kinda sounds like you actually want user specific applications mounted?
Maybe this is helpful: http://flask.pocoo.org/docs/patterns/appdispatch/
I have had similar requirement for my application where each endpoint /<SOMEID>/rest/other for given SOMEID should be bounded to a different function. One way to achieve this is keeping a lookup dictionary where values are the function that handle the specific SOMEID. For example take a look at this snippet:
func_look_up_dict = {...}
#app.route('<SOMEID>/rest/other', methods=['GET'])
def multiple_func_router_endpoint(SOMEID):
if SOMEID in func_look_up_dict.keys():
return jsonify({'result' = func_look_up_dict[SOMEID]()}), 200
else:
return jsonify({'result'='unknown', 'reason'='invalid id in url'}), 404
so for this care you don't really need to "dynamically" add url rules, but rather use a url rule with parameter and handle the various cases withing a single function. Another thing to consider is to really think about the use case of such URL endpoint. If <username> is a parameter that needs to be passed in, why not to use a url rule such as /rest/product/<username> or pass it in as an argument in the GET request?
Hope that helps.

How to implement Google-style pagination on app engine?

See the pagination on the app gallery? It has page numbers and a 'start' parameter which increases with the page number. Presumably this app was made on GAE. If so, how did they do this type of pagination? ATM I'm using cursors but passing them around in URLs is as ugly as hell.
You can simply pass in the 'start' parameter as an offset to the .fetch() call on your query. This gets less efficient as people dive deeper into the results, but if you don't expect people to browse past 1000 or so, it's manageable. You may also want to consider keeping a cache, mapping queries and offsets to cursors, so that repeated queries can fetch the next set of results efficiently.
Ben Davies's outstanding PagedQuery class will do everything that you want and more.

Categories