I have implemented a distributed web crawler on rabbitMQ. Everything is almost done except the visited url set. I want to have some kind of shared variable between different crawlers.
Furthermore, as I have been reading, the size of this url set will be huge and should be stored in disk.
What is the best way to store, access and share this visited-urls list in a distributed environment?
As majidkabir says, Nutch is quite a good solution...but that doesn't answer the question since it's about how to track state when building your own crawler.
I'll offer the approach I took when I created a crawler in Node (https://www.npmjs.com/package/node-nutch). As you can see from the name, the approach I've taken is in turn modelled on the approach taken in Nutch.
All I did was to use the URL as the key (after normalising), and then store a simple JSON file in S3 that contained the state of the crawl. When it was time to run the next crawl I'd whizz through each of these JSON files looking for candidates to be crawled and then after retrieving the page, set the JSON to indicate when to crawl next.
The number of pages I was crawling was never very large so this worked fine, but if it did get larger I'd put the JSON into something like ElasticSearch and then search for URLs to be crawled based on a date field.
Ideally, any storage that is scalable and supports indexing can be used for such use cases.
Some of the systems that I know are being used for such purposes are Solr, ElasticSearch, Redis or any SQL databases that can scale.
I have used Redis for the same purpose and I have been storing approximately 2 millions of URLs. I am quite sure, by increasing the nodes I should be able to scale easily.
You can use Apache Nutch for crawling, this library has ability to crawl url in a specific period and use some algorithms for this purpose.
For example: When a page with specific url don't changed in second crawling nutch increase the period of next crawling and if it changed, decrease this period.
You can create your own nutch plugin for parsing the data that nutch crawled or using predefined nutch plugins.
Related
First I guess I should say I am still a bit of a Django/Python noob. I am in the midst of a project that allows users to enter a URL, the site scrapes the content from that page and returns images over a certain size and the page title tag so the user can then pick which image they want to use on their profile. A pretty standard scenario I assume. I have this working by using Selenium (headless Chrome browser) to grab the destination page content, some python to determine the file size and then my Django view spits it all out into a template. I then have it coded in such a way that the image the user selects will be downloaded and stored locally.
However I seriously doubt the scalability of this, its currently just running locally and I am very concerned about how this would cope if there were lots of users all running at the same time. I am firing up that headless chrome browser every time a request is made which doesn't sound efficient, I am having to download the image to determine it's size so I can decide whether it's large enough. One example took 12 seconds to get from me submitting the URL to displaying the results to the user, whereas the same destination URL put through www.kit.com (they have very similar web scraping functionality) took 3 seconds.
I have not provided any code as the code I have does what it should, I think the approach however is incorrect. To summarise what I want is:
To allow a user to enter a URL and for it to return all images (or just the URLs to those images) from that page over a certain size (width/height), and the page title.
For this to be the most efficient solution, taking into account it would be run concurrently between many users at once.
For it to work in a Django (2.0) / Python (3+) environment.
I am not completely against using the API from a 3rd party service if one exists, but it would be my least preferred option.
Any help/pointers would be much appreciated.
You can use 2 python solutions in your case:
1) BeautifulSoup, and here is a good answer how to download the images using it. You just have to make it a separate function and pass site as the argument into it. But also it is very easy to parse only images links as u said - depending on speed what u need (obviously scraping files, specially when there is a big amount of them, will be much slower, than links). This tool is just for parsing and scrapping the content of the page.
2) Scrapy - this is much more powerful tool, framework, via it you can connect your spider to a Django models, operate with images much more efficiently, using its built-in image-pipelines. It is much more flexible with a lot of features how to operate with scrapped data. I am not sure if u need to use it in your project and if it is not overpowered in your case.
Also my advice is to run the spider in some background task like Queue or Celery, and call the result via AJAX, cuz it may take some time to parse the content - so don't make a user wait for the response.
P.S. You can even combine those 2 tools in some cases :)
I have deployed some Scrapy spiders to scrape data which I can download in .csv from ScrapingHub.
Some of these spiders have FilePipeline which I used to download files (pdf) to a specific folder. Is there any way I can retrieve these files from ScrapingHub via the platform or API?
Though I have to go over scraping hubs documentation, I'm quite certain despite of having a file explorer there's no actual file being generated or it's being ignored while during the crawl and stanchion... I assume so given the fact that if you try to deploy one of your projects with anything other than the files that correspond to a scrappy project() unless you do some hacking around with your settings and setup file for then scrapinghub to accept your extra parameters orphans)... For example if you try to have a ton of start URLs in a file and then use a real and function to parse all that into your spider... Works like a charm but scrapinghub wasn't built with that in mind...
I assume you know that you can download your files in a CSV or desired format straight from the web interface... Personally I use scraping Hub client API in Python... All three libraries of which I believe to our deprecated at this point but you kind of have to mix and match to get fully functional feet for example...
I have this side gig for a pretty well-known pornt website, what I do for them is content aggregation I spend a lot of time watching a lot o debauchery but for people like myself it's just fun... Hope that you're reading this and not think too much of a pervert LOL got to make that money right? Anyways... By using scraping hugs API client for python I'm able to connect to my account with the API key and maneuver my way around and do as I please; personally I think that there are some limitations , not so much of a limitation is just that one thing that really bothers me is that the function to get the name of a project was deprecated with the first version of there client Library... I'd like the see, when I'm parsing my items the name of the project of which where the spider is to run different jobs Ergo the crawlz... So when I first started to mess around with the client it just look messy,
What's even more awesome it's my life so sweet is that when you create a project run your spider and all your items are collected can directly download these files from the web interface as I mentioned, but what I can do is Target my output to give me desired effect for example.
I'm crawling a site and I'm getting a media item like videos, there are three things you always need. The name of the media or the title of the video , the URL source to where the video can be reached or URL where it is embedded of which you can then request for every instance that you need... And of course the metadata of what is tags and categories that are associated with video media.
The largest crawl that's outputted the most items now I believe was 150,000, it was abroad crawl and it was something like the 15 or 17% of dupla Fire cases. Each video I then call using the API client by its given dictionary or key value (not a dictionary btw)... Of course in my case I will always use all three of the key values but I can Target categories or tags of which RN or under the key value o its corresponding place and output only the items and their totality (meaning still output all three items) foot print out only the ones that meet or match a particular string or expression I want allowing me the able who really Parts through my content quite effectively. In this particular scrapy project, Im just simply printing out or creating a .m3u playlist from all this 'pronz'!
I need to scrap about 40 random webpages at the same time.These pages vary on each request.
I have used rpcs in python to fetch the urls and scraped the data using BeautifulSoup. It takes about 25 seconds to scrap all the data and display on the screen.
To increase the speed i stored the data in appengine datastore so that each data is scraped only once and can be accessed from there quickly.
But the problem is-> as the size of the data increases in the datastore, it is taking too long to fetch the data from the datastore(more than the scraping).
Should i use memcache Or shift to mysql? Is mysql faster than gae-datastore?
Or is there any other better way to fetch the data as quickly as possible?
Based on what I know about your app it would make sense to use memcache. It will be faster, and will automatically take care of things like expiring stale cache entries.
I have a list of university urls like www.harvard.edu, www.berkeley.eduetc.
I need to find the cse department urls in the respective websites What i initially set out to do was to crawl through links in url given and by specifying depth , say 3 it will follow links and try to find cse or computer or lists of words in the links scraped on that page matching links along with their anchor text are returned as results into a csv file.
if no links containing cse or as such words it should return not found or something like that
idea is to push the csv file later onto a database. How can I do this?
This is quite complex task and I recommend using database using structure like this:
TABLE pages (
`absolute_url` VARCHAR(255) NOT NULL,
`visited` TINYINT(1) DEFAULT 0,
-- Additional fields
UNIQUE KEY (`absolute_url`)
)
Little explanation:
absolute_url contains full URL to page (starting with http[s]://) and has unique index placed on it. This way you make sure you won't end up in recursion or process multiple link twice
visited informs you on whether site was already visited (and processed). This field is important for preventing double visitation again and allow you to recover gracefully if your program crashes (ie. network downtime)
You may implement those things on your own via CSV, or associate arrays, but database is most familiar solution for me.
And the algorithm would go as:
database.insert( 'http://www.harvard.edu')
database.insert( 'http://www.berkeley.edu')
# In case of failure you'll start at this point:
while database.get_count( WHERE visited = 0) < 0:
for url in database.get_records( WHERE visited = 0):
content = http_client.load(url)
time.sleep(5) # You don't want to flood server
# Problematic URLs will be parsed later
if (not content) or (http_client.is_error):
continue;
for i in content.get_all_urls():
i = make_absolute(i, url)
# Also don't crawl remote sites, images, ...
if not is_valid_url(i):
continue
database.insert(i)
This is pseudocode, I won't implement it all for you.
For solve your problem you can use scrapy framework.
Extracted from scrapy web:
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
I am creating an aggregator and I started with scrapy as my initial tool set.
First I only had a few spiders, but as the project grows it seems like I may have hundreds or even a thousand different spiders as i scrape more and more sites.
What is the best way to manage these spiders as some websites only need to be crawled once, some on a more regular basis?
Is scrapy still a good tool when dealing with so many sites or would you recommend some other technology.
You can check out the project scrapely, that is from the creators of scrapy. But, as far as I know, it is not suitable for parsing sites containing javascript (more precisely, if the parsed data is not generated by javascript).