Python3: http.client with privoxy/TOR making bad requests - python

I'm trying to use TOR with http.client.HTTPConnection, but for some reason I keep getting weird responses from everything. I'm not really sure exactly how to explain, so here's an example of what I have:
class Socket(http.client.HTTPConnection):
def __init__(self, url):
super().__init__('127.0.0.1', 8118)
super().set_tunnel(url)
#super().__init__(url)
def get(self, url = '/', params = {}):
params = util.params_to_query(params)
if params:
if url.find('?') == -1: url += '?' + params
else: url += '&' + params
self.request(
'GET',
url,
'',
{'Connection': 'Keep alive'}
)
return self.getresponse().read().decode('utf-8')
If I run this with:
sock = Socket('www.google.com')
print(sock.get())
I get:
<html><head><meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<title>301 Moved</title></head><body>
<h1>301 Moved</h1>
The document has moved
here.
</body></html>
Google is redirecting me to the url I just requested, except with the privoxy port. And it gets weirder - if I try https://check.torproject.org:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>Welcome to sergii!</title>
</head>
<body>
<h1>Welcome to sergii!</h1>
This is sergii, a system run by and for the Tor Project.
She does stuff.
What kind of stuff and who our kind sponsors are you might learn on
db.torproject.org.
<p>
</p><hr noshade=""/>
<font size="-1">torproject-admin</font>
</body>
</html>
If I don't try to use privoxy/TOR, I get exactly what your browser gets at http://www.google.com or http://check.torproject.org. I don't know what's going on here. I suspect the issue is with python because I can use TOR with firefox, but I don't really know.
Privoxy log reads:
2015-06-27 19:28:26.950 7f58f4ff9700 Request: www.google.com:80/
2015-06-27 19:30:40.360 7f58f4ff9700 Request: check.torproject.org:80/
TOR log has nothing useful to say.

This ended up being because I was connecting with http:// and those sites wanted https://. It does work correctly for sites that accept normal http://.

Related

What is the correct 3-legged OAuth workflow in Python (Example: ImmoScout API) ? (How to get request_token)

I am trying to access the ImmoScout24 web api for a data science project in python and I kind of stuck in the 3 legged authentication process. I googled the problem but its kind of special, so maybe someone can help me.
I want to implement the workflow described on: https://api.immobilienscout24.de/api-docs/authentication/three-legged/#callback-url
To obtain the request_token (first step within the authentication process) I tried the following approach:
API Credentials are stored in those two variables:
client_key
client_secret
The Python code looks like follows
immoscout_api = OAuth1Session(client_key,
client_secret=client_secret)
request_token_url='http://rest.immobilienscout24.de/restapi/security/oauth/request_token'
fetch_response = immoscout_api.fetch_request_token(request_token_url)
I am getting an Error in my Jupyter Notebook that looks like the following:
TokenRequestDenied: Token request failed with code 403, response was '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Bad request.
We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
<BR clear="all">
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: M_HHRf9VaNN9xFRqWlHWt2txfuIsBE5fe6siJACFUFjVWw20p91jLg==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>'.
Can somebody help me to obtain the request toke?

Non-HTTP URL (used to call a phone's SMS app) only intermittently successful on the same device, what could be the cause?

I'm calling a non-http url from my Django app. The URL is sms:14085701890?body:Hello and visit my website at example.com. This is actually an HTML trick: calling that opens the native SMS app of the mobile device, and prefills 1-408-570-1890 in the to field, and Hello and visit my website at example.com in the body field of the SMS.
I tried sending this response via a view - but all non-http responses are considered unsafe - and over-riding this behavior has cost me a lot of time without a solution.
So instead, currently the view just renders a response to a template that has the following code:
<html>
<head>
<meta http-equiv="Cache-Control" content="max-age=0">
<meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="0">
<meta http-equiv="refresh" content="0;URL={{ url }}">
</head>
<body>
<b>Please wait...</b><br>
</body>
</html>
Using meta http-equiv="refresh" in head automatically redirects me to the SMS app with {{ url }} prefilled (where url = sms:<number>?body:<body>). This set up gets the job done for me, but there's a problem.
The problem: After my SMS app gets pre-filled correctly, I just go back and try the whole thing all over again, and what I notice is that the second time around, the body text in the SMS turns up completely blank. If I keep re-trying, I have a ~50% failure rate (body pre-fills correctly ~50% of the time, otherwise it's just blank - though the number always correctly pre-fills).
Strange part is, if I print the {{ url }} in the template above, the url gets correctly printed every time, meaning the body text was formed correctly by my code every time.
What on earth could be going on? I'm out of leads. Perhaps someone can creatively guesstimate what might be going on, and what can I do to fix this. Let me know if you need more info.

How to handle redirects in url opener

Sorry for rookie question. I was wondering if there is an efficient url opener class in python that handle redirects. I'm currently using simple urllib.urlopen() but It's not working. This is an example:
http://thetechshowdown.com/Redirect4.php
For this url, the class I'm using does not follow the redirection to:
http://www.bhphotovideo.com/
and only shows:
"You are being automatically redirected to B&H.
Page Stuck? Click Here ."
Thanks in advance.
Use module requests - it folows redirections as default.
But page can be redirected by javascript so none of modules will follow this kind of redirection.
Turn off javascript in browser and go to http://thetechshowdown.com/Redirect4.php to see if it redirects you to other page
I checked this page - there is javascript redirect and HTML redirect (tag with "refresh" argument). Both aren't normal redirection send by server - so any module will not follow this redirection. You have to read page, find url in code and connect with that url.
import requests
import lxml, lxml.html
# started page
r = requests.get('http://thetechshowdown.com/Redirect4.php')
#print r.url
#print r.history
#print r.text
# first redirection
html = lxml.html.fromstring(r.text)
refresh = html.cssselect('meta[http-equiv="refresh"]')
if refresh:
print 'refresh:', refresh[0].attrib['content']
x = refresh[0].attrib['content'].find('http')
url = refresh[0].attrib['content'][x:]
print 'url:', url
r = requests.get(url)
#print r.text
# second redirection
html = lxml.html.fromstring(r.text)
refresh = html.cssselect('meta[http-equiv="refresh"]')
if refresh:
print 'refresh:', refresh[0].attrib['content']
x = refresh[0].attrib['content'].find('http')
url = refresh[0].attrib['content'][x:]
print 'url:', url
r = requests.get(url)
# final page
print r.text
That happens because of soft redirects. urllib is not following the redirects because it does not recognize them as such. In fact a HTTP response code 200 (page found) is issued and redirection will happen by some sort of side effect in browsers.
The first page has a HTTP responde code 200, but contains the following:
<meta http-equiv="refresh" content="1; url=http://fave.co/1idiTuz">
which instructs the browser to follow the link. The second resource issues a HTTP responsec code 301 or 302 (redirect) to another resource, where a second soft redirect takes place, this time with Javascript:
<script type="text/javascript">
setTimeout(function () {window.location.replace(\'http://bhphotovideo.com\');}, 2.75 * 1000);
</script>
<noscript>
<meta http-equiv="refresh" content="2.75;http://bhphotovideo.com" />
</noscript>
Unfortunately, you will have to extract the URLs to follow by hand. However, it's not difficult. Here is the code:
from lxml.html import parse
from urllib import urlopen
from contextlib import closing
def follow(url):
"""Follow both true and soft redirects."""
while True:
with closing(urlopen(url)) as stream:
next = parse(stream).xpath("//meta[#http-equiv = 'refresh']/#content")
if next:
url = next[0].split(";")[1].strip().replace("url=", "")
else:
return stream.geturl()
print follow("http://thetechshowdown.com/Redirect4.php")
I will leave the error handling to you :) also note that this might result in an endless loop if the target page contains a <meta> tag too. It is not your case, but you could add some sort of checks to prevent that: stop after n redirects, see if page is redirecting to itself, whichever you think is better.
You will probably need to install the lxml library.
The meta refresh redirection urls from html could look like any of these:
Relative urls:
<meta http-equiv="refresh" content="0; url=legal_notices_en.htm#disclaimer">
With quotes inside quotes:
<meta http-equiv="refresh" content="0; url='legal_notices_en.htm#disclaimer'">
Uppercase letters in the content of the tag:
<meta http-equiv="refresh" content="0; URL=legal_notices_en.htm#disclaimer">
Summary:
Use lxml.xml to parse the html,
Use a lower() and two split()s to get the url part,
Strip eventual wrapping quotes and spaces,
Get the absolute url,
Store the cache of the results in a local file with shelves (useful if you have lots of urls to test).
Usage:
print get_redirections('https://www.google.com')
Returns something like:
{'final': u'https://www.google.be/?gfe_rd=fd&ei=FDDASaSADFASd', 'history': [<Response [302]>]}
Code:
from urlparse import urljoin, urlparse
import urllib, shelve, lxml, requests
from lxml import html
def get_redirections(initial_url, url_id = None):
if not url_id:
url_id = initial_url
documents_checked = shelve.open('tested_urls.log')
if url_id in documents_checked:
print 'cached'
output = documents_checked[url_id]
else:
print 'not cached'
redirecting = True
history = []
try:
current_url = initial_url
while redirecting:
r = requests.get(current_url)
final = r.url
history += r.history
status = {'final':final,'history':history}
html = lxml.html.fromstring(r.text.encode('utf8'))
refresh = html.cssselect('meta[http-equiv="refresh"]')
if refresh:
refresh_content = refresh[0].attrib['content']
current_url = refresh_content.lower().split('url=')[1].split(';')[0]
before_stripping = ''
after_stripping = current_url
while before_stripping != after_stripping:
before_stripping = after_stripping
after_stripping = before_stripping.strip('"').strip("'").strip()
current_url = urljoin(final, after_stripping)
history += [current_url]
else:
redirecting = False
except requests.exceptions.RequestException as e:
status = {'final':str(e),'history':[],'error':e}
documents_checked[url_id] = status
output = status
documents_checked.close()
return output
url = 'http://google.com'
print get_redirections(url)

Python Authentication

I'm new to python and after struggling with myself a little bit I almost got the code to working.
import urllib, urllib2, cookielib
username = 'myuser'
password = 'mypass'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login = urllib.urlencode({'user' : username, 'pass' : password})
opener.open('http://www.ok.com/', login)
mailb = opener.open('http://www.ok.com/mailbox').read()
print mailb
But the output I got after print is just a redirect page.
<html>
<head>
<META HTTP-EQUIV="Refresh" CONTENT="0;URL=https://login.ok.com/login.html?skin=login-page&dest=REDIR|http://www.ok.com/mailbox">
<HTML dir=ltr><HEAD><TITLE>OK :: Redirecting</TITLE>
</head>
</html>
Thanks
If a browser got that response, it would interpret it as a request to redirect to the URL specified.
You will need to do something similar with your script. You need to parse the <META> tag and locate the URL and then do a GET on that URL.

Cannot fetch a web site with python urllib.urlopen() or any web browser other than Shiretoko

Here is the URL of the site I want to fetch
https://salami.parc.com/spartag/GetRepository?friend=jmankoff&keywords=antibiotic&option=jmankoff%27s+tags
When I fetch the web site with the following code and display the contents with the following code:
sock = urllib.urlopen("https://salami.parc.com/spartag/GetRepository?friend=jmankoff&keywords=antibiotic&option=jmankoff's+tags")
html = sock.read()
sock.close()
soup = BeautifulSoup(html)
print soup.prettify()
I get the following output:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>
Error message
</title>
</head>
<body>
<h2>
Invalid input data
</h2>
</body>
</html>
I get the same result with urllib2 as well. Now interestingly, this URL works on only Shiretoko web browser v3.5.7. (when I say it works I mean that it brings me the right page). When I feed this URL into Firefox 3.0.15 or Konqueror v4.2.2. I get exactly the same error page (with "Invalid input data"). I don't have any idea what creates this difference and how I can fetch this page using Python. Any ideas?
Thanks
If you see the urllib2 doc, it says
urllib2.build_opener([handler, ...])ΒΆ
.....
If the Python installation has SSL support (i.e., if the ssl module can be imported), HTTPSHandler will also be added.
.....
you can try using urllib2 together with ssl module. alternatively, you can use httplib
That's exactly what you get when you click on the link with a webbrowser. Maybe you are supposed to be logged in or have a cookie set or something
I get the same message for firefox 3.5.8 (shiretoko) on linux

Categories