Scrapy with TOR (Windows) - python

I created a Scrapy project with several spiders to crawl some websites. Now I want to use TOR to:
Hide my ip from the crawled servers;
Associate my requests to different ips, simulating accesses from different users.
I have read some info about this, for example:
using tor with scrapy framework, How to connect to https site with Scrapy via Polipo over TOR?
The answers from these links weren't helpful to me. What are the steps that I should take to make Scrapy work properly with TOR?
EDIT 1:
Considering answer 1, I started by installing TOR. As I am using Windows I downloaded the TOR Expert Bundle (https://www.torproject.org/dist/torbrowser/5.0.1/tor-win32-0.2.6.10.zip) and read the chapter about how to configure TOR as a relay (https://www.torproject.org/docs/tor-doc-windows.html.en). Unfortunately there is little or any information about how to do it on Windows. If I unzip the downloaded archive and run the file Tor\Tor.exe nothing happens. However, I can see in the Task Manager that a new process is instantiated. I don't know what is the best way to proceed from here.

After a lot of research, I found a way to setup my Scrapy project to work with TOR on Windows OS:
Download TOR Expert Bundle for Windows (1) and unzip the files to a folder (ex. \tor-win32-0.2.6.10).
The recent TOR's versions for Windows don't come with a graphical user interface (2). It is probably possible to setup TOR only through config files and cmd commands but for me, the best option was to use Vidalia. Download it (3) and unzip the files to a folder (ex. vidalia-standalone-0.2.21-win32). Run "Start Vidalia.exe" and go to Settings. On the "General" tab, point Vidalia to TOR (\tor-win32-0.2.6.10\Tor\tor.exe).
Check on "Advanced" tab and "Tor Configuration File" section the torrc file. I have the next ports configured:
ControlPort 9151
SocksPort 9050
Click Start Tor on the Vidalia Control Panel UI. After some processing you should se on the status the message "Connected to the Tor network!".
Download Polipo proxy (4) and unzip the files to a folder (ex. polipo-1.1.0-win32). Read about this proxy on the link 5.
Edit the file config.sample and add the next lines to it (in the beginning of the file, for example):
socksParentProxy = "localhost:9050"
socksProxyType = socks5
diskCacheRoot = ""
Start Polipo through cmd. Go to the folder where you unzipped the files and enter the next command "polipo.exe -c config.sample".
Now you have Polipo and TOR up and running. Polipo will redirect any request to TOR through port 9050 with SOCKS protocol. Polipo will receive any HTTP request to redirect trough port 8123.
Now you can follow the rest of the tutorial "Torifying Scrapy Project On Ubuntu" (6). Continue in the step where the tutorial explains how to test the TOR/Polipo communications.
Links:
https://www.torproject.org/download/download.html.en
https://tor.stackexchange.com/questions/6496/tor-expert-bundle-on-windows-no-installation-instructions
https://people.torproject.org/~erinn/vidalia-standalone-bundles/
http://www.pps.univ-paris-diderot.fr/~jch/software/files/polipo/
http://www.pps.univ-paris-diderot.fr/~jch/software/polipo/tor.html
http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu

A detailed step-by-step Explanation is here
http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu/
The Basic steps there are:
Install Tor and Polipo (for linux this might require to add a repository).
Configure Polipo to talk with TOR using SOCK Connection (see above link).
Create a custom Middleware to use tor as a http proxy and to randomly change the scrapy user agent
to suppress depreciation warning from above example, write
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
instead of 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
What is your szenario? Have you thought about renting Proxy Servers?

Related

Using a server to run tasks with Selenium Webdriver

Background
I have built a chrome extension to run tasks automatically with python and selenium on my localhost.
I would like to use my extension on my smartphone (with a different network). For this, I just need to use a specific browser and its running well
The Problem
In order for my extension to work on a different device, I need a server to receive a request passing all the information to start the job.
The API is done, but I don't know how to proceed with the server part.
What I've Tried
I tried to host on Heroku, it's working and I can receive requests, but web-driver isn't working. To being headless and the server is located in Europe, the website is blocking my access to the content.
Also, I tried to use a proxy, but I need authentication, but that doesn't work with selenium.
Further Explanation
Basically, I need to enable my chrome extension to do a request directly to my personal computer or use a server with a graphical user interface to set-up the proxy manually, but I don't have any idea how to proceed with this or if that is even the best option.
Any thoughts about a good work around?

How to configure a tor proxy on windows?

How do I configure a tor proxy on windows?
For example, I want to run the following python script through a tor proxy:
import requests
proxies = {
'http':'socks5h://localhost:9050',
'https':'socks5h:/localhost:9050'
}
url = 'someWebsite.onion'
res = requests.get(url, proxies=proxies)
On unix systems, you can simply run tor in terminal, but this doesn't seem to work on windows.
navigate to \Tor Browser\Browser\TorBrowser\Data\Tor and edit torcc file
# ControlPort 9051
SocksPort 9051
Then restart tor.
Use tor proxy everywhere:
control panel -> network & internet -> internet options -> connection -> lan setting -> tick proxy server & goto advance & add:
proxy 127.0.0.1 port 9051
Use tor proxy in a browser like firefox:
options -> network setting -> tick Manual proxy configuration & add:
proxy 127.0.0.1 port 9051
Use with Python requests library:
import requests
proxies = {
'http':'socks5://127.0.0.1:9051',
'https':'socks5:/127.0.0.1:9051'
}
url = 'https://check.torproject.org/'
res = requests.get(url, proxies=proxies)
Note: You have to keep running tor browser for this
Txtorcon and Stem are libraries developed by the Tor Project for controlling Tor from Python. Stem doesn't have any external dependencies. However, txtorcon allows one to launch Tor from Python, rather than just connect to a running instance.
Both of these libraries require a Tor binary already installed though. It is possible to use the Tor included with the Tor Browser Bundle, connecting on port 9150 (with control port of 9151).
Better yet though, you can download the "Expert Bundle" to get the Tor binary without any browser. The download for it is not currently linked from their new website, but the latest version can still be pulled from https://dist.torproject.org/torbrowser/. Navigate to a directory for either the alpha or stable version and search for "tor-win64-" (or "tor-win32-" if you need 32-bit).
he working good, but need start tor service in windows
or made service tor auto start in windows when startup
all time tor service in this path after download tor browser
your_installation_path\Tor Browser\Browser\TorBrowser\Tor
the bin name is tor.exe you should add path in the windows PATH

windows command prompt with automatic proxy URL for internet access

I want to install python libs using pip from windows command prompt, but unable to due to no proxy settings. Internet connection requires 'Automatic proxy configuration URL' for browsers normally. What should i do for 'command prompt'. 'set HTTP_PROXY' is not working as it requires proxy server IP and port. in my case, its an 'Automatic proxy configuration URL'.
Download the file at the automatic proxy configuration URL. It's (usually a small) Javascript file. Interpret it manually (alas!) and create a proper proxy setting for pip.

OS X Server: Using wsgi Python on Non-Standard Ports

I'm working with a simple website (a few html files and one Python script) that's running on my LAN. In Chrome I can pull up the HTML files and Python scripts through port 80, as normal and I am using WSGIScriptAlias commands in /Library/Server/Web/Config/apache2/httpd_wsgi.conf that are working and I've set up the site and specified for it to be allowed to use Python apps through the Server GUI application.
For several reasons, I'm using a different port number for this site. If I go to http://mycomputer.lan:1234/myfile.html, I can see the HTML file. But if I go to http://mycomputer.lan:1234/MyWSGIApplicationScript, the server (the latest version, got it installed today) reports:
Not Found
The requested URL /LandSearch was not found on this server.
I've seen this work before, on other servers and I remember setting it up and getting it working on another system running OS X so the wsgi scripts worked fine on a non-standard port, but I don't have access to the notes and information I had at that time. That makes me suspect it's probably a simple configuration option I need to change for the server to find and use the Python scripts from a different port.
What do I need to reconfigure to get it to use wsgi scripts on a non-standard port?
Even AppleCare didn't have an answer for this one.
When I first set up the site, I enabled the 'Python "Hello World" app at /wsgi'. This is in the advanced settings:
I did that just for testing, so when I set up the site again, I didn't bother with it. It turns out that this one setting does more than enable one wsgi application. It turns out that, by default, the file /Library/Server/Web/Config/apache2/httpd_wsgi.conf is not read by Apache while setting up a virtual host. But checking the box to enable this one wsgi webapp means that the following line:
Include /Library/Server/Web/Config/apache2/httpd_wsgi.conf
will be included in the configuration file for this particular virtual host. Any scripts aliases defined with the WSGIScriptAlias command in that file will now be available to your website, no matter what port your website is on.

How to open a directory/folder on a machine on LAN using python?

I am designing a website for a local server on our lan, so that anyone who tires to access that IP from a browser sees a web page and when he clicks on some link on that web page then a directory or some folder from that server should open.
I am using python for this purpose and the server is just like another PC with windows installed.
If you just want to redirect the user to your file server, then it sort of depends on what operating system they're using. If everybody's going to be on Windows, then you should be able to include a link to "//Your-Fileserver-Name/Path1/Path2". Obviously you have to share the appropriate files on your server using Windows file-sharing.

Categories