Using Python 3.3 to access blocked webpages - python

I'm trying to download webpages off the internet. I'm able to steal the HTML (with URLlib), but I can't download images correctly. There's already a question for that though. My question is, is there any way I can use python to bypass a firewall to access 'blocked' webpages?
Ideally it would be using some obscure code or module, but if it's impossible, could someone tell me a good workaround using a different method (like a proxy)?

If you want to extract images from a HTML page, you need to parse it with re module
import re
using regex to extract only the img src tag. You can also use a parser alredy written. For example BeautifulSoup > http://www.crummy.com/software/BeautifulSoup/
A firewall is a passive component of a perimeter defense into a computer network that can also serve as contact points between two or more sections of the network, ensuring a protection in terms of security of the network itself. So you have to work directly in the network, not through the code language.

Related

Python: How to query for WHOIS information using the urlib library, no CAPTCHAS

I'd like to query for WHOIS information on specific domains through either https://lookup.icann.org/ or https://mxtoolbox.com/SuperTool.aspx# using the urllib library.
Due to restrictions on my device, I cannot use any other Python libraries other than those built-in, so I can't use the whois library. I can't add libraries to get around CAPTCHAS either. I can't use paid services, and I probably can't sign up for things. I cannot change these restrictions at the moment.
This is my current idea, but it doesn't appear to work with the structure of these particular sites:
response = urllib.request.urlopen("url here" + domainIwantToQuery)
html = response.read()
html = html.decode("utf-8")
I want to search at least for any domain ending in ".com", ".net", all the common domain endings would be good.
Typically the domains I'm looking at have been created recently, if that information helps at all.
Could someone point me in the right direction of how to use these sites with Python using urllib, given my restrictions, and if it's even possible to do so?
Let me know if I need to add any more information or clarify something.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Extracting and parsing HTML from a secure website with Python?

Let's dive into this, shall we?
Ok, I need to write a script (I don't care what language, prefer something like Python or Javascript, but whatever works I will take time to learn). The script will access multiple URL's, extract text from each site and store it into a folder on my PC. (From there I am manipulating the data with Python, which I know how to do.)
EDIT:
Currently I am using python's NLTK module. Here is a simple version of my code:
url = "<URL HERE>"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
This code works fine for both http and https, but not for instances where authentication is required.
Is there a Python module which deals with secure authentication?
Thanks in advance for help! And to the mods who will view this as a bad question, please just give me ways to make it better. I need ideas..from people, not Google.
Mechanize (2) is one option, other is just with urllib2

How to detect which websites the user is viewing or connecting to

I'm writing a Python application that, among other things, needs to know which websites the user is looking at in the web browser or otherwise connecting to on OS X and, if possible, Linux. This is to track how long the user is accessing certain websites.
I know that on OS X there is a Cocoa call which returns the current page in Safari, but this must also work with Chrome and Firefox at a minimum, ideally with any client, known or unknown to the software.
The first thing I've looked into is pcap via libpcap which I can use in Python with pylibpcap. pcap is for packet capture, and in theory as I understand, I could detect whether packets are flowing to/from certain "black-listed" IP addresses. This would sort of work, but if a static web-page were open in the browser and left as is, I would not be able to detect it via this mechanism.
First, will I even be able to do what I've described above with libpcap? I'm a beginner with network filtering and the like, so I'm not entirely sure.
Second, is there a better way to do this?
(The app TimeSink for OS X has an interesting approach, which is to look at what is displayed in the title bar to decide which website the user is browsing. This is not ideal for me for two reasons: (1) I may not be able to conclusively decide what domain is being visited by the title and (2) I can only see the title of the active tab.)
Maybe to use Twisted proxy and pass all browsers through it?
You will be able to analyse HTTP headers and extract relevant information.
Here is an example: https://github.com/nbareil/twisted-proxy

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.
What's a generic way of doing this that will work on most major news sites?
What are some good tools or libraries for data mining? (preferably python based)
There are a number of ways to do it, but, none will always work. Here are the two easiest:
if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.
A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.
You'd use the script like: python webarticle2text.py <url>
There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.
Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/
Check the following script. It is really amazing:
from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)
More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:
pip3 install newspaper3k
For a solution in Java have a look at https://github.com/kohlschutter/boilerpipe :
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
But there is also a python wrapper around this available here:
https://github.com/misja/python-boilerpipe
It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.
Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.
You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.
Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.
There is a recent (early 2020) comparison of various methods of extracting article body, without and ads, menus, sidebars, user comments, etc. - see https://github.com/scrapinghub/article-extraction-benchmark. A report, data and evaluation scripts are available. It compares many options mentioned in the answers here, as well as some options which were not mentioned:
python-readability
boilerpipe
newspaper3k
dragnet
html-text
Diffbot
Scrapinghub AutoExtract
In short, "smart" open source libraries are adequate if you need to remove e.g. sidebar and menu, but they don't handle removal of unnecessary content inside articles, and are quite noisy overall; sometimes they remove an article itself and return nothing. Commercial services use Computer Vision and Machine Learning, which allows them to provide a much more precise output.
For some use cases simpler libraries like html-text are preferrable, both to commercial services and to "smart" open source libraries - they are fast, and ensure information is not missing (i.e. recall is high).
I would not recommend copy-pasting code snippets, as there are many edge cases even for a seemingly simple task of extracting text from HTML, and there are libraries available (like html-text or html2text) which should be handling these edge cases.
To use a commercial tool, in general one needs to get an API key, and then use a client library. For example, for AutoExtract by Scrapinghub (disclaimer: I work there) you would need to install pip install scrapinghub-autoextract. There is a Python API available - see https://github.com/scrapinghub/scrapinghub-autoextract README for details, but an easy way to get extractions is to create a .txt file with URLs to extract, and then run
python -m autoextract urls.txt --page-type article --api-key <API_KEY> --output res.jl
I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:
http://feeds.guardian.co.uk/theguardian/rss
I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

Categories