Is there any alternative to using proxy in scrapy. The source site has blocked the server which I'm using for running spiders. I've added ProxyMiddleware in the project and randomized the proxy. But the problem is the proxies are also being blocked by the source site. I've also set the DOWNLOAD_DELAY to 5 but the problem is still alive. Is there any other way to access the site without using proxies other than shifting to new server?
Using tor with privoxy solved my problem of blocking.
Install tor
$ sudo apt-get install tor
Install polipo
$ sudo apt-get install polipo
configure privoxy to use tor socks proxy.
$sudo nano /etc/polipo/config
Add following lines at the end of file.
socksParentProxy = localhost:9050
diskCacheRoot=""
disableLocalInterface=""
Add proxy middleware in middlewares.py.
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://localhost:8123'
spider.log('Proxy : %s' % request.meta['proxy'])
Activate the proxyMiddleware in Project settings.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100
}
You may want the squid.
It will shield failure proxy, use proxy faster, automatic rotation, automatic retry forwarding, and set the rules.
Just set your spider to the same export agent.
Related
I am using Requests to scrape a website. My scraping code runs on one computer, but I need to make the requests come from a different computer (from the perspective of the website being scraped). I understand that I can do this with Requests by passing a proxies= argument when creating my session. I understand that I have two options, either using an HTTP proxy or a SOCKS proxy. I understand how to host an SOCKS proxy, because it just works over SSH, so I just need to make it so that I can SSH into the proxy machine from the machine running the scraping code and use -D, like this
# Generate key
ssh-keygen -o -a 100 -t ed25519 -C ''
# Copy key to proxy machine
ssh-copy-id -i ~/.ssh/id_ed25519.pub <username>#<ip of the computer acting as a proxy>
# Open a connection to that server on some local port (I randomly chose port 14171)
ssh -D 14171 root#<ip of the computer acting as a proxy>
then I can make requests like this
from requests import Session
proxies = {
'http': 'socks5://localhost:14171',
'https': 'socks5://localhost:14171',
}
session = Session()
session.proxies.update(proxies)
session.get('http://example.com')
I understand that with an HTTP proxy it's quite similar, I just do
proxies = {
'http': 'http://user:pass#10.10.1.10:1080',
'https': 'http://user:pass#10.10.1.10:1080',
}
but what do I use on the server to make it act as an HTTP proxy with a password? And are messages sent in the clear or encrypted?
There are many implementations of HTTP proxies to choose from. Squid seems to be the first result on Google. I also tried Tinyproxy. With Squid, you set it up like this:
Install Squid
apt install squid apache2-utils
Create the password file
sudo touch /etc/squid/squid_passwd
sudo chown proxy /etc/squid/squid_passwd
Then edit the configuration file
mv /etc/squid/squid.conf /etc/squid/squid.conf.default # move default file out of the way
vim /etc/squid/squid.conf
and paste the following as the configuration:
http_port 3128
auth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/squid_passwd
auth_param basic realm proxy
acl authenticated proxy_auth REQUIRED
http_access allow authenticated
you can add those lines to the end of the file, but the problem is that their default config is 8000 lines of documentation (about 25 lines of actual default config) and somewhere in there they forbid all connections (probably all connections not from localhost) that you'd have to read and ain't nobody got time for that, so I just cleared and put that config as the default. You should probably take the time to actually learn Squid if you're going to use it though...
Create a password for a user (youruser is a username, you can choose whatever)
htpasswd /etc/squid/squid_passwd youruser
Restart Squid
service squid restart
Open the port in the firewall
iptables -A INPUT -m state --state NEW -m tcp -p tcp --dport 3128 -j ACCEPT
You can then check that it works with Curl:
curl --proxy <the IP address of your proxy>:3128 --proxy-user youruser:<password> "http://icanhazip.com"
Tinyproxy is pretty similar, it has the advantage that you don't have to download a separate package just to set a password and their default config file is actually short enough to read...
Install Tinyproxy
sudo apt install tinyproxy
Edit the config file
sudo vim /etc/tinyproxy/tinyproxy.conf
These are the options I needed to set:
change the port to some random port Port 17724
comment out the Allow 127.0.0.1 line to allow connections from any IP
add a line to enable a password BasicAuth youruser yourpassword
(optional) disable adding a "Via" header (this is a way to let servers that you're making requests to know that you're using a proxy) with DisableViaHeader yes
(optional) disable everything except for reverse-proxying with ReverseOnly Yes
You may want to read through the entire default config file, maybe there are other options you need for your use-case.
Restart the Tinyproxy systemd service
sudo service tinyproxy restart
Open the port in the firewall
sudo iptables -A INPUT -m state --state NEW -m tcp -p tcp --dport 17724 -j ACCEPT
The you can then use your proxy with Requests like this
proxies = {
'http': 'http://<youruser>:<password>#<the IP address of your proxy>:3128',
'https': 'http://<youruser>:<password>#<the IP address of your proxy>:3128'
}
Proxies also allow you to limit connections to only a give IP address, so if the server you're running the code on has a static IP, it would be a good idea to limit connections only from that IP. Note that HTTP proxying is not encrypted, so a man-in-the-middle would be able to see your password and then use your proxy.
Sources:
https://www.vultr.com/docs/how-to-install-squid-proxy-on-centos
https://www.vultr.com/docs/install-squid-proxy-on-ubuntu (a bit outdated)
I'm getting the error:
urllib3.exceptions.ProxySchemeUnknown: Proxy URL had no scheme, should start with http:// or https://
but the proxies are fine & so is the URL.
URL = f"https://google.com/search?q={query2}&num=100"
mysite = self.listbox.get(0)
headers = {"user-agent": USER_AGENT}
while True:
proxy = next(proxy_cycle)
print(proxy)
proxies = {"http": proxy, "https": proxy}
print(proxies)
resp = requests.get(URL, proxies=proxies, headers=headers)
if resp.status_code == 200:
break
Print results:
41.139.253.91:8080
{'http': '41.139.253.91:8080', 'https': '41.139.253.91:8080'}
On Linux unset http_proxy and https_proxy using terminal on the current location of your project
unset http_proxy
unset https_proxy
I had the same problem and setting in my terminal https_proxy variable really helped me. You can set it as follows:
set HTTPS_PROXY=http://username:password#proxy.example.com:8080
set https_proxy=http://username:password#proxy.example.com:8080
Where proxy.example.com is the proxy address (in my case it is "localhost") and 8080 is my port.
You can figure out your username by typing echo %username% in your command line. As for the proxy server, on Windows, you need to go to "Internet Options" -> "Connections" -> LAN Settings and tick "Use a proxy server for your LAN". There, you can find your proxy address and port.
An important note here. If you're using PyCharm, try first running your script from the terminal. I say this because you may get the same error if you will just run the file by "pushing" the button. But using the terminal may help you get rid of this error.
P.S. Also, you can try to downgrade your pip to 20.2.3 as it may help you too.
I was having same issue. I resolved with upgrading requests library in python3 by
pip3 install --upgrade requests
I think it is related to lower version of requests library conflicting higher version of python3
I would like to try to open a page using proxy requests.
https://stackoverflow.com/…/make-requests-using-python-over…
I have this code:
def get_tor_session():
session = requests.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http': 'socks5://127.0.0.1:9050',
'https': 'socks5://127.0.0.1:9050'}
return session
# Make a request through the Tor connection
# IP visible through Tor
session = get_tor_session()
print(session.get("http://httpbin.org/ip").text)
# Above should print an IP different than your public IP
# Following prints your normal public IP
print(requests.get("http://httpbin.org/ip").text)
But i see:
requests.exceptions.InvalidSchema: Missing dependencies for SOCKS support.
What should I do?
Thanks
In order to make requests use socks proxy, you need to install it with it's dependency.
pip install requests requests[socks]
This means that requests is using socks as a proxy and that socks is not installed.
Just run pip install pysocks
I've done alot of research, and I can't find anything which actually solves my issue.
Since basically no site accepts mitmdumps certificate for https, I want to ignore those hosts. I can access a specific website with "--ignore-hosts (ip)" like normal, but I need to ignore all HTTPS/SSL hosts.
Is there any way I can do this at all?
Thanks alot!
There is a script file called tls_passthrough.py on the mitmproxy GitHub which ignores hosts which has previously failed a handshake due to the user not trusting the new certificate. Although it does not save for other sessions.
What this also means is that the first SSL connection from this perticular host the will always fail. What I suggest you do is write out all the IPs which has failed previously into a text document and ignore all hosts which are in that text file.
tls_passthrough.py
To simply start it, you just add it with the script argument "-s (tls_passthrough.py path)"
Example,
mitmproxy -s tls_passthrough.py
you need a simple addon script to ignore all tls connections.
import mitmproxy
class IgnoreAllTLS:
def __init__(self) -> None:
pass
def tls_clienthello(self, data: mitmproxy.proxy.layers.tls.ClientHelloData):
'''
ignore all tls event
'''
# LOGC("tls hello from "+str(data.context.server)+" ,ignore_connection="+str(data.ignore_connection))
data.ignore_connection = True
addons = [
IgnoreAllTLS()
]
the latest version ( 7.0.4 for now) is not support ignore_connection feature yet,so u need to install the main source version:
git clone https://github.com/mitmproxy/mitmproxy.git
cd mitmproxy
python3 -m venv venv
activate the venv before startup the proxy
source /path/to/mitmproxy/venv/bin/activate
startup mitmproxy
mitmproxy -s ignore_all_tls.py
You can ignore all https/SSL traffic by using a wildcard:
mitmproxy --ignore-hosts '.*'
I'm trying to use Pip behind a proxy server which requires authentication. I've installed cntlm and filled out the hashed passwords. When I run this:
cntlm -c cntlm.ini -I -M http://www.google.co.uk
I enter my password and then get this as a result:
Config profile 1/4... Auth not required (HTTP code: 200)
Config profile 2/4... Auth not required (HTTP code: 200)
Config profile 3/4... Auth not required (HTTP code: 200)
Config profile 4/4... Auth not required (HTTP code: 200)
Your proxy is open, you don't need another proxy.
However, pip doesn't work, still giving me a timeout. Knowing that I don't need another proxy is all fine and dandy, but pip still times out. Port 3128 is working because I can telnet on that port and it shows as listening under netstat. So what should I do from here?
Thank you.
I have had the exact same issue.
Cntlm is used for authentication proxy servers, these statements mean that your server does not require authentication.
The pip command does have a --proxy option. Try using something like:
pip install --proxy=10.0.0.1:80 package_name
If this works, you know that you don't need authentication to access the web. If it still fails try:
pip install --proxy=user:password#10.0.0.1:80 package_name
This works to get around authentication. I have written a small cmd script to get around this in windows:
#echo off
:: GetPwd.cmd - Get password with no echo.
setlocal
<nul: set /p passwd=
for /f "delims=" %%i in ('python -c "from getpass import getpass; pwd = getpass();print pwd;"') do set passwd=%%i
echo.
::Prompt for the package name
set /p package=What package would you like to get:
::Get the package with PIP
pip install --proxy="admin:%passwd%#PROXY_ADDRESS:80" %package%
endlocal