Python JSON fetching via Tor Socks5 proxy - python

I am trying to fetch JSON data, via a Python3 script, on Tails. I would like to know that this code is secure and doesn't leak IP or any other stuff. I know that Tails is configured to block any problematic connection, so I would like to know if my code is safe.
import json
import requests
url = 'https://api.bitcoincharts.com/v1/markets.json'
proxy = {'https': "socks5://127.0.0.1:9050"}
with open('datafile','w') as outfile:
json.dump( (requests.get(url , proxies=proxy ).json()) ,outfile)
As you can see I am using the requests that has been suggested for proxies. I use socks5 just like the docs suggest, configured for the localhost 9050 port that tor listens to.
I guess if the website would be http then I would have to change the proxy to 'http' as well.
One thing I am not sure about is whether to use the port 9150 or 9050 , the code seems to work on both proxies, but I don't know which one is safer.
Other than these, is my code safe to use on Tails?

Related

Python Requests - Get Server IP

I'm making a small tool that tests CDN performance and would like to check where the response comes from. I thought of getting the host's IP and then using one of the geolocation API's on github to check the country.
I've tried doing so with
import socket
...
raw._fp.fp._sock.getpeername()
...however that only works when i use stream=True for the request and that in turn breaks the tool's functionality.
Is there any other option to get the server ip with requests or in a completely different way?
The socket.gethostbyname() function from Python's socket library should solve your problem. You can check it out in the Python docs here.
Here is an example of how to use it:
import socket
url="cdnjs.cloudflare.com"
print("IP:",socket.gethostbyname(url))
All you need to do is pass the url to socket.gethostbyname() and it will do the rest. Just make sure to remove the http:// before the URL because that will trip it up.
I could not get Akilan's solution to give the IP address of a different host that I was using. socket.gethostbyname() and getpeername() were not working for me. They are not even available. His solution did open the door.
However, navigating the socket object, I did find this:
socket.getaddrinfo('host name', 443)[0][4][0]
I wrapped this in a try/except block.
Maybe there is a prettier way.

Proxy detection in Python3

The script below works fine when I am using script at home (same PC!):
import urllib.request
x = urllib.request.urlopen('https://www.google.com/')
print(x.read())
the same does not work using the same script when I am connected at work. I do not know proxy address or IP, so my script should use the same way as IE or anything else on this PC.
I found some suggestions about using proxy , but the point it I do not know proxy IP or details. When I move the script to another PC it might have different proxy, so I think hardcoding it is not good approach.
Can I somehow inform Python to autodetect proxy settings?
Going by your eample, I am assuming you are doing a https call over proxy. The urllib documentation hints its not supported. So, instead you may have to settle down with http.
In order to validate that there is nothing wrong with your setup, you may try to do open the IP directly:
import urllib
# IP address for `http://www.google.com` is `216.58.205.196`
x = urllib.urlopen('http://216.58.205.196')
print x.read()
A. There are lots of complaints about Python's trippy auto-detect proxy settings in various other threads. I had this issue only once years ago and I opted for setting a fixed proxy instead of trying to configure auto-detect. To know your proxy, you can go to chrome url chrome://net-internals/#proxy or run netstat -an | grep EST command.
B. Once you have proxy address, you can use following code:
import urllib
# IP address for `http://www.google.com` is `216.58.205.196`
x = urllib.urlopen('http://216.58.205.196',
proxies={'http': 'http://www.someproxy.com:3128'})
print x.read()
If you cannot avoid https, then you may consider requests library. I didn't test this, but requests documentation looks quite promising. This is how it can be done!
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('https://example.org', proxies=proxies)
Edit:
1: You may need to setup proxy authentication in order for 3.B. to work
2: For Special characters, you would need to have the password in unicode: 'p#ssw0rd'.decode('utf-8')
Hope this helps!

Python - Socket error

My code :-
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("www.python.org" , 80))
s.sendall(b"GET https://www.python.org HTTP/1.0\n\n")
print(s.recv(4096))
s.close()
Why the output shows me this:-
b'HTTP/1.1 500 Domain Not Found\r\nServer: Varnish\r\nRetry-After: 0\r\ncontent-type: text/html\r\nCache-Control: private, no-cache\r\nconnection: keep-alive\r\nContent-Length: 179\r\nAccept-Ranges: bytes\r\nDate: Tue, 11 Jul 2017 15:23:55 GMT\r\nVia: 1.1 varnish\r\nConnection: close\r\n\r\n\n\n\nFastly error: unknown domain \n\n\nFastly error: unknown domain: . Please check that this domain has been added to a service.'
How can I fix it?
This is wrong on multiple levels:
to access a HTTPS resource you need to create a TLS connection (i.e. ssl_wrap on top of an existing TCP connection, with proper certificate checking etc) and then send the HTTP request. Of course the TCP connection in this case should go to port 443(https) not 80 (http).
the HTTP request should only contain the path, not the full URL
the line end must be \r\n not \n
you better send a Host header too since many severs require it
And that's only the request. Properly handling the response is a different topic.
I really really recommend to use an existing library like requests. HTTP(S) is considerably more complex as most think who only had a look at a few traffic captures.
import requests
x = requests.get('https://www.python.org')
print x.text
With the requests library, HTTPS requests are very simple! If you're doing this with raw sockets, you have to do a lot more work to negotiate a cipher and etc. Try the above code (python 2.7).
I would also note that, in my experience, Python is excellent for doing things quickly. If you are learning about networking and cryptography, try writing a HTTPS client on your own using sockets. If you want to automate something quickly, use the tools that are available to you. I almost always use requests for this type of task. As an additional note, if you're interested in parsing HTML content, check out the PyQuery library. I've used it to automate interaction with many web services.
Requests
PyQuery

Failure to change identify using Tor when scraping google

I am trying to automate google search but unfortunately my IP is blocked. After some searches, it seems like using Tor could get me a new IP dynamically. However, after adding the following code block into my existing code, google still blocks my attempts even under the new IP. So I am wondering is there anything wrong with my code?
Code (based on this)
from TorCtl import TorCtl
import socks
import socket
import urllib2
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
__originalSocket = socket.socket
def newId():
''' Clean circuit switcher
Restores socket to original system value.
Calls TOR control socket and closes it
Replaces system socket with socksified socket
'''
socket.socket = __originalSocket
conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="mypassword")
TorCtl.Connection.send_signal(conn, "NEWNYM")
conn.close()
socket.socket = socks.socksocket
## generate a new ip
newId()
### verify the new ip
print(urllib2.urlopen("http://icanhazip.com/").read())
## run my scrape code
google_scrape()
new error message
<br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>
IP address: 89.234.XX.25X<br>Time: 2017-02-12T05:02:53Z<br>
Google (and many other sites such as "protected" by Cloudflare) filter requests coming via TOR by the IP address of Tor exit nodes. They can do this because the list of IP addresses of Tor exit nodes is public.
Thus changing your identity - which in turn changes your Tor circuit and will likely result in using a different exit node and thus different IP (although the latter two are not guaranteed) - will not work against this block.
For your use case you might consider using VPN instead of Tor, as their IP addresses are less likely to be blocked. Especially if you use non-free VPN.

Using IP authenticated proxies in a distributed crawler

I'm working on a distributed web crawler in Python running on a cluster of CentOS 6.3 servers, the crawler uses many proxies from different proxy providers. Everything works like a charm for username/password authenticated proxy providers. But now we have bought some proxies that uses IP based authentication, this means that when I want do crawl into a webpage using one of this proxies I need to make the request from a subset of our servers.
The question is, is there a way in Python (using a library/software) to make a request to a domain passing trough 2 proxies? (one proxy is one of the subset needed to be used for the IP authentication and the second is the actual proxy from the provider) Or is there another way to do this without setting up this subset of our servers as proxies?
The code I'm using now to make the request trough a proxy uses the requests library:
import requests
from requests.auth import HTTPProxyAuth
proxy_obj = {
'http':proxy['ip']
}
auth = HTTPProxyAuth(proxy['username'], proxy['password')
data = requests.get(url, proxies = proxy_obj, auth = auth)
Thanks in advance!
is there a way in Python (using a library/software) to make a request
to a domain passing trough 2 proxies?
If you need to go through two proxies, it looks like you'll have to use HTTP tunneling, so any host which isn't on the authorized list would have to connect an HTTP proxy server on one of the hosts which is, and use the HTTP CONNECT method to create a tunnel to the remote proxy, but it may not be possible to achieve that with the requests library.
Or is there another way to do this without setting up this subset of
our servers as proxies?
Assuming that the remote proxies which use IP address-based authentication are all expecting the same IP address, then you could instead configure a NAT router, between your cluster and the remote proxies, to translate all outbound HTTP requests to come from that single IP address.
But, before you look into implementing either of these unnecessarily complicated options, and given that you're paying for this service, can't you just ask the provider to allow requests for the entire range of IP addresses which you're currently using?

Categories