I am thinking Web Crawler but how to start?

I am thinking Web Crawler but how to start? - python

I am working at a company that deals with phish and fake Facebook accounts. I want to show my dedication to the "mission". We are unable to passively monitor facebook pages for when they are removed. I am thinking a web crawler but I am curious on how to design one that constant checks a specific link to see if the Facebook page is still active or not? I hope this made sense?

Yes! You can use crawling. However, if you want it to be as fast as possible, crawling may not be the best way to do it. If you're interested this is how I'd do it using HTTPConnection. also, unfortunately, the link has to be completely broken.
If you need more information then you will most likely have to use an API or web crawler to check if the link is broken(Thus meaning it has to link to nowhere),
from http.client import HTTPConnection # Importing HTTPConnection from http.client.
conn = HTTPConnection('www.google.com') # Connecting to 'google.com'
conn.request('HEAD', '/index.html') # Request data.
res = conn.getresponse() # Now we get the data sent back.
print(res.status, res.reason) # Finally print it.
If it returns '302 Found' then it should be an active web page,
I hope this helps! Please tell me if this isn't what you wanted. :)
Thanks,
~Coolq

You can send a http request to tell the account is active or not by it's response status, Python has some standard library, you may have a look at Internet Protocols and Support.
Personally, I will recommend use requests:
import requests
response = requests.get("http://facebook.com/account")
if response.status_code == 302: # or 404
# the page is missing
If you really care about speed or performance, you should use multiprocessing or asynchronous i/o (like gevent) in Python.
If you are focus on crawl,you may have a look at Scrapy
Here you notice one of the main advantages about Scrapy: requests are
scheduled and processed asynchronously. This means that Scrapy doesn’t
need to wait for a request to be finished and processed, it can send
another request or do other things in the meantime. This also means
that other requests can keep going even if some request fails or an
error happens while handling it.

https://www.quora.com/How-can-I-build-a-web-crawler-from-scratch/answer/Raghavendran-Balu
One of the best articles I have read about Crawlers.
A web crawler might sound like a simple fetch-parse-append system, but watch out! you may over look the complexity. I might deviate from the question intent by focussing more on architecture than implementation specifics.I believe it is necessary because, to build a web scale crawler, the architecture of the crawler is more important than the choice of language/ framework.
Architecture:
A bare minimum crawler needs at least these components:
HTTP Fetcher : To retrieve web page from the server.
Extractor: Minimal support to extract URL from page like anchor links.
Duplicate Eliminator: To make sure same content is not extracted twice unintentionally. Consider it as a set based data structure.
URL Frontier: To prioritize URL that has to fetched and parsed. Consider it as a priority queue
Datastore: To store retrieve pages and URL and other meta data.
A good starting point to learn about architecture is:
Web Crawling
Crawling the Web
Mercator: A scalable, extensible Web crawler
UbiCrawler: a scalable fully distributed web crawler
IRLbot: Scaling to 6 billion pages and beyond
(single-sever crawler) and MultiCrawler: a pipelined architecture
Programming Language: Any high level language with good network library that you are comfortable with is fine. I personally prefer Python/Java. As your crawler project might grow in terms of code size it will be hard to manage if you develop in a design-restricted programming language. While it is possible to build a crawler using just unix commands and shell script, you might not want to do so for obvious reasons.
Framework/Libraries: Many frameworks are already suggested in other answers. I shall summarise here:
Apache Nutch and Heritrix (Java): Mature, Large scale, configurable
Scrapy (Python): Technically a scraper but can be used to build a crawler.
You can also visit https://github.com/scrapinghub/distributed-frontera - URL frontier and data storage for Scrapy, allowing you to run large scale crawls.
node.io (Javascript): Scraper. Nascent, but worth considering, if you are ready to live with javascript.
For Python: Refer Introduction to web-crawling in Python
Code in Python: https://www.quora.com/How-can-I-build-a-web-crawler-from-scratch/answer/Rishi-Giri-1
Suggestions for scalable distributed crawling:
It is better to go for a asynchronous model, given the nature of the problem.
Choose a distributed data base for data storage ex. Hbase.
A distributed data structure like redis is also worth considering for URL frontier and duplicate detector.
For more Information visit: https://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
References:
Olston, C., & Najork, M. (2010). Web crawling. Foundations and Trends in Information Retrieval, 4(3), 175-246.
Pant, G., Srinivasan, P., & Menczer, F. (2004). Crawling the web. In Web Dynamics (pp. 153-177). Springer Berlin Heidelberg.
Heydon, A., & Najork, M. (1999). Mercator: A scalable, extensible web crawler.World Wide Web, 2(4), 219-229.
Boldi, P., Codenotti, B., Santini, M., & Vigna, S. (2004). Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience, 34(8), 711-726.
Lee, H. T., Leonard, D., Wang, X., & Loguinov, D. (2009). IRLbot: scaling to 6 billion pages and beyond. ACM Transactions on the Web (TWEB), 3(3), 8.
Harth, A., Umbrich, J., & Decker, S. (2006). Multicrawler: A pipelined architecture for crawling and indexing semantic web data. In The Semantic Web-ISWC 2006 (pp. 258-271). Springer Berlin Heidelberg.

Related

Can websites detect web scraping if I act like a human (Selenium, Python)?

I use Selenium in Python and I want to scrape a lot of websites from one company (many hundreds). But that shouldn't burden the system under any circumstances and because this is a very large website anyway, it shouldn't be a problem for them.
Now my question is if the company can somehow discover that I'm doing web scraping if I'm acting like a human. That means I stay on a website for an extra long time and allow extra time to pass.
I don't think you can recognize me by my IP, because the period of time is very long while I do this and I think it looks like normal traffic.
Are there any other ways that websites can see that I am doing webscraping or generally running a script?
Many Thanks
(P.S.:I know that a similar question has already been asked, but the answer was simply that he doesn't behave like a human and visits the website too quickly. But it's different for me ...)

When you scraping make sure that you respect the robots.txt file which is based at the root of the website. It set the rules of crawling: which parts of the website should not be scraped, how frequently it can be scraped.
User navigation patterns are monitored by large companies to detect bots and scraping attempts. There are many anti scraping tools available in market which are using AI to monitor the various patterns to differentiate between a human and a bot.
Some of the main techniques used to prevent scraping apart from software are
Captcha,
Honey traps,
UA monitoring,
IP monitoring,
Javascript encryption, etc..
There are many more, so what i am saying is that yes it can be detected.

One way they can tell is from your browser headers

Web scraping CNN data

I have a question- does CNN permit you to scrape data if it's for your own personal use? for instance, if i wanted to write a quick program that would scrape the price of a certain stock, can i scrape CNN money?
I've just started learning python so I apologize if this is a stupid question.

Obligatory I am not a lawyer.
In CNN's terms of use page it states that
You may not modify, publish, transmit, participate in the transfer or
sale, create derivative works, or in any way exploit, any of the
content, in whole or in part.
You may download copyrighted material
for your personal use only
So it looks like if you do it for personal use only and don't share any of the results of the work you would be fine.
However, some sites can scrapers automatically if they issue too many requests, so be sure to rate-limit your scraping, and don't request too many pages.

Can I scrape all URL results using Python from a google search without getting blocked?

I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.

It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.

Python's BeautifulSoup as a Web Application?

I wrote a fairly complex script using URLLIB and BeautifulSoup, and last night I wondered if there was any way to produce the same results as a web application.
I'm not asking for a tutorial, but can someone point me in the general direction of how/what proficiency's would be needed to write an application that would let someone input scraping criteria, and a URL, and output the correct source, all in a webpage?

For a basic one page web application, I'd recommend integrating your existing code into one of the available python web micro-frameworks. Try Flask to start; this framework is lightweight and seems ideal for your use-case (another options is bottle, and pyramid and django for larger apps). The tutorials for these frameworks should be enough to get you on the right track.

Scraping Ajax - Using python

I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.

Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.

Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.

Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)

As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.

You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.