check tor connection is established before running scrapy

check tor connection is established before running scrapy - python

I'd like to check tor before I start crawling using python scrapy. I am using polipo/tor/scrapy on linux.
with this settup scrapy correctly using tor on its crawls. The way I check if the scrapy using tor correctly is to crawl this page in myspider.
class mySpider(scrapy.Spider):
def start_requests(self):
yield Request('https://check.torproject.org/', self.parse)
def parse(self, response):
logging.info("Check tor page:" + str(response.css('.content h1::text')))
However I think there might be a better/clean way of doing it. I know I can check tor service status or check ip address but I want to actually check whether tor connection is correctly established.

A somewhat definitive way to do this is to connect to Tor's control port and issue GETINFO status/circuit-established.
If Tor has an active circuit built, it will return:
250-status/circuit-established=1
250 OK
If Tor hasn't been used for a while, this could be 0. You can also call GETINFO dormant which would yield 250-dormant=1. Most likely when you then try to use Tor, it will build a circuit and dormant will become 0 and circuit-established will be 1 barring any major network issues.
In either case, dormant=0 or circuit-established=1 should be enough to tell you can use Tor.
It's a simple protocol so you can just open a socket to the control port, authenticate, and issue commands, or use Controller from Stem.
See the control spec for more info.

Related

Selenium Python, reset tcp connection after each request

So i'm using python and selenium to repeat requests to a specific website. I am using a rotating proxy that is supposed to give me a new ip after each requests. The issue is that when I do a request for example to whatsmyip.org in the chrome window I don't always get a fresh ip.
If my requests are done every 2seconds I keep the same ip, but if it's every 10-15s then my ip changes.
If you have any idea how I could fix this it would be nice, for example a chrome option ? or a capability maybe ? I really don't know.

Failure to change identify using Tor when scraping google

I am trying to automate google search but unfortunately my IP is blocked. After some searches, it seems like using Tor could get me a new IP dynamically. However, after adding the following code block into my existing code, google still blocks my attempts even under the new IP. So I am wondering is there anything wrong with my code?
Code (based on this)
from TorCtl import TorCtl
import socks
import socket
import urllib2
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
__originalSocket = socket.socket
def newId():
''' Clean circuit switcher
Restores socket to original system value.
Calls TOR control socket and closes it
Replaces system socket with socksified socket
'''
socket.socket = __originalSocket
conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="mypassword")
TorCtl.Connection.send_signal(conn, "NEWNYM")
conn.close()
socket.socket = socks.socksocket
## generate a new ip
newId()
### verify the new ip
print(urllib2.urlopen("http://icanhazip.com/").read())
## run my scrape code
google_scrape()
new error message
<br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>
IP address: 89.234.XX.25X<br>Time: 2017-02-12T05:02:53Z<br>

Google (and many other sites such as "protected" by Cloudflare) filter requests coming via TOR by the IP address of Tor exit nodes. They can do this because the list of IP addresses of Tor exit nodes is public.
Thus changing your identity - which in turn changes your Tor circuit and will likely result in using a different exit node and thus different IP (although the latter two are not guaranteed) - will not work against this block.
For your use case you might consider using VPN instead of Tor, as their IP addresses are less likely to be blocked. Especially if you use non-free VPN.

Tor, Stem, and Sockets - Changing Identity with TOR

I'm trying to run Tor through python. My goal is to be able to switch exits or otherwise alter my IP from time-to-time at a time of my choosing. I've followed several tutorials and encountered several different errors.
This code prints my IP address
import requests
r = requests.get('http://icanhazip.com/')
r.content
# returns my regular IP
I've downloaded and 'installed' the Tor browser (although it seems to run from an .exe). When it is running, I can use the following code to return my Tor IP, which seems to change every 10 minutes or so, so everything seems to be working so far
import socks
import socket
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
r = requests.get('http://icanhazip.com/')
print r.content
# Returns a different IP
Now, I've found instructions here on how to request a new identity programmatically, but this is where things begin to go wrong. I run the following code
from stem import Signal
from stem.control import Controller
with Controller.from_port(port = 9151) as controller:
controller.authenticate()
controller.signal(Signal.NEWNYM)
and I get the error "SOCKS5Error: 0x01: General SOCKS server failure" at the line that creates the controller. I thought this might be a configuration problem - the instructions say that there should be a config file called 'torrc' that contains the port numbers, apart from other things.
Eventually, in the directory C:\Users\..myname..\Desktop\Tor Browser\Browser\TorBrowser\Data\Tor I found three files, a 'torrc', a 'torrc-defaults', and a 'torrc.orig.1'. My 'torrc-defaults' looks similar to the 'torrc' shown in the documentation, my 'torrc.orig.1' is empty, and my 'torrc' has two lines of comments that look as follows with some more settings afterwards but no port settings
# This file was generated by Tor; if you edit it, comments will not be preserved
# The old torrc file was renamed to torrc.orig.1 or similar, and Tor will ignore it
...
I've tried to make sure that the two ports listed in the 'torrc-defaults' match the ports in the socks statement at the beginning and the controller statment further on. When these are 9150 and 9151 respectively I get the general SOCKS server failure listed above.
When I try and run the Controller with the wrong port (in this case, 9051 which was recommended in other posts but for me leads to Tor browser failing to load when I adjust the 'torrc-defaults') then I instead get the error message "[Errno 10061] No connection could be made because the target machine actively refused it"
Finally, when I run the Controller with Tor browser running in the background but without first running the SOCKS statement, I instead get a lot of warning text and finally an error, as shown in part below:
...
...
250 __OwningControllerProcess
DEBUG:stem:GETCONF __owningcontrollerprocess (runtime: 0.0010)
TRACE:stem:Sent to tor:
SIGNAL NEWNYM
TRACE:stem:Received from tor:
250 OK
TRACE:stem:Received from tor:
650 SIGNAL NEWNYM
TRACE:stem:Sent to tor:
QUIT
TRACE:stem:Received from tor:
250 closing connection
INFO:stem:Error while receiving a control message (SocketClosed): empty socket content
At first I thought this looked quite promising - I put in a quick check line like this both before and after the new user request
print requests.get('http://icanhazip.com/').content
It prints the IP, but unfortunately it's the same both before and afterwards.
I'm pretty lost here - any support would be appreciated!
Thanks.

Get your hashed password :
tor --hash-password my_password
Modify the configuration file :
sudo gedit /etc/tor/torrc
Add these lines
ControlPort 9051
HashedControlPassword 16:E600ADC1B52C80BB6022A0E999A7734571A451EB6AE50FED489B72E3DF
CookieAuthentication 1
Note :- The HashedControlPassword is generated from first command
Refer this link
I haven't tried it on my own because I am using mac, but I hope this helps with configuration

You can use torify to achieve your goal. for example run the python interpreter as this command "sudo torify python" and every connection will be routed through TOR.
Reference on torify: https://trac.torproject.org/projects/tor/wiki/doc/TorifyHOWTO

how to use proxies without the remote site being able to detect the host/host IP?

I'm attempting to use a proxy, via python, in order to log into a site from a different, specific IP address. It seems that certain websites, however, can detect the original (host) IP address. I've investigated the issue a bit and here's what I found.
There are four proxy methods I've tried:
Firefox with a proxy setting.
Python with mechanize.set_proxies.
Firefox in a virtual machine using an internal network, along with another virtual machine acting as a router (having two adapters: a NAT, and that internal network), set up such that the internal network traffic is routed through a proxy.
TorBrowser (which uses Firefox as the actual browser).
For the first three I used the same proxy. The Tor option was just for additional testing, not via my own proxy. The following things are behaviors I've noticed that are expected:
With all of these, if I go to http://www.whatismyip.com/, it gives the correct IP address (the IP address of the proxy, not the host computer).
whatismyip.com says "No Proxy Detected" for all of these.
Indeed, it seems like the websites I visit do think my IP is that of the proxy. However, there have been a few weird cases which makes me think that some sites can somehow detect my original IP address.
In one situation, visiting a non-US site via Firefox with a non-US proxy, the site literally was able to print my originating IP address (from the US) and deny me access. Shouldn't this be impossible? Visiting the site via the virtual machine with that same non-US proxy, or the TorBrowser with a non-US exit node, though, the site was unable to do so.
In a similar situation, I was visiting another non-US site from a non-US proxy. If I logged into the site from Firefox within the virtual machine, or from the TorBrowser with a non-US exit node, the site would work properly. However, if I attempted to log in via Firefox with a proxy (the same proxy the virtual machine uses), or with mechanize, it would fail to log in with an unrelated error message.
In a third situation, using the mechanize.set_proxies option, I overloaded a site with too many requests so it decided to block access (it would purposefully time out whenever I logged in). I thought it might have blocked the proxy's IP address. However, when I ran the code from a different host machine, but with the same proxy, it worked again, for a short while, until they blocked it again. (No worries, I won't be harassing the site any further - I re-ran the program as I thought it might have been a glitch on my end, not a block from their end.) Visiting that site with the Firefox+proxy solution from one of the blocked hosts also resulted in the purposeful timeout.
It seems to me that all of these sites, in the Firefox + proxy and mechanize cases, were able to find out something about the host machine's IP address, whereas in the TorBrowser and virtual machine cases, they weren't.
How are the sites able to gather this information? What is different about the TorBrowser and virtual machine cases that prevents the sites from gathering this information? And, how would I implement my python script so that the sites I'm visiting via the proxy can't detect the host/host's IP address?

It's possible that the proxy is reporting your real IP address in the X-Forwarded-For HTTP header, although if so, I'm surprised that the WhatIsMyIP site didn't tell you about it.
If you first visited the non-US site directly, and then later again using the proxy, it's also possible that the site might have set cookies in your browser on your first visit that let the site identify you even after your IP address changes. This could account for the differences you've observed between browser instances.
(I've noticed that academic journal sites like to do that. If I try to access a paywalled article from home and get blocked because I wasn't using my university's proxy server, I'll typically have to clear cookies after enabling the proxy to be allowed access.)

urllib.urlopen to open page on same port just hangs

I am trying to use urllib.urlopen to open a web page running on the same host and port as the page I am loading it from and it is just hanging.
For example I have a page at: "http://mydevserver.com:8001/readpage.html" and I have the following code in it:
data = urllib.urlopen("http://mydevserver.com:8001/testpage.html")
When I try and load the page it just hangs. However if I move the testpage.html script to a different port on the same host it works fine. e.g.
data = urllib.urlopen("http://mydevserver.com:8002/testpage.html")
Does anyone know why this might be and how I can solve the problem?

A firewall perhaps? Try opening the page from the command line with wget/curl (assuming you're on Linux) or on the browser, with both ports on settings. Furthermore, you could try a packet sniffer to find out what's going on and where the connection gets stuck. Also, if testpage.html is dynamically generated, see if it is hit, check webserver logs if the request shows up there.

Maybe something is already running on port 8001. Does the page open properly with a browser?

You seem to be implying that you are accessing a web page that is scripted in Python. That implies that the Python script is handling the incoming connections, which could mean that since it's already handling the urllib call, it is not available to handle the connection that results from it as well.
Show the code (or tell us what software) you're using to serve these Python scripts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.