Does anyone have any ideas on why the Urlib2 version returns the webpage, while the Requests version returns a connection error:
[Errno 10060] A connection attempt failed because the connected party
did not properly respond after a period of time, or established
connection failed because connected host has failed to respond.
Urllib2 code (Working):
import urllib2
proxy = urllib2.ProxyHandler({'http': 'http://login:password#proxy1.com:80'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
wPage = urllib2.urlopen('http://www.google.com/')
print wPage.read();
Requests code (Not working - Errno 10060):
import requests
proxy = {"http": "http://login:password#proxy1.com:80"}
wPage = requests.get('http://www.google.com/', proxies=proxy)
print wPage.text
The requests version returns intranet webpages, but gives an error on internet pages.
I am running Python 2.7
* Edit *
Based on m170897017's suggestion, I looked for differences in the GET requests. The only difference was in Connection and Proxy-Connection.
Urllib2 version:
header: Connection: close
header: Proxy-Connection: close
Requests version :
header: Connection: Keep-Alive
header: Proxy-Connection: Keep-Alive
I forced the Requests version to close both of those connections by modifying the header
header = {
"Connection": "close",
"Proxy-Connection": "close"
}
The GET request for both now match, however the Requests version still does not work.
Try this:
import urllib2
proxy = urllib2.ProxyHandler({'http': '1.1.1.1:9090'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com/')
datum = response.read().decode("UTF-8")
response.close()
print datum
A little late... but for future reference this line:
proxy = {"http": "http://login:password#proxy1.com:80"}
should also have a second key/value pair for https, even if it's not going to be used.
Also there is an awesome requests module called proxy requests that does something very similar:
pip3 install proxy-requests
https://pypi.org/project/proxy-requests/
Related
I am scraping search results from google using people_also_ask module. The module itself dont have method to use proxies but I manually added proxies in the module. When I got blocked from google I printed the status and it was printing my ip address was banned from sending requests. The code I added in people_also_ask module to use proxies is
proxies = {
'http' : "http://username:passward#ip:port"
}
response = SESSION.get(URL, params=params, headers=HEADERS, proxies=proxies)
.I know it is an illegal activity but I want to know why it happens for education purpose mainly. I think the code to extract the data is irrelevant so I am adding simple code to send request using people_also_ask module
import people_also_ask as paa
queries = ["how to boil eggs","how to make cake","price of poco f1","price of wooden table","best soap in us","how much tesla worth"]
for query in queries:
questions = paa.get_related_questions(query ,40)
Note: The changes are made in first function named search() of google.py of people_also_people module
Note: I am doing searchs from browser without any problem. why is google allowing me to use google but blocked from using the script
The answer is quite simple. Although it is a proxy service, it doesn't guarantee 100% anonymity. When you send the HTTP GET request via the proxy server, the request sent by your program to the proxy server is:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Now, when the proxy server sends this request to the actual destination, it sends:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Via: 1.1 naxserver (squid/3.1.8)
X-Forwarded-For: 122.126.64.43
Cache-Control: max-age=18000
Connection: keep-alive
As you can see, it throws your IP (in my case, 122.126.64.43) in the HTTP header: X-Forwarded-For and hence the website knows that the request was sent on behalf of 122.126.64.43
Read more about this header at: https://www.rfc-editor.org/rfc/rfc7239
If you want to host your own squid proxy server and want to disable setting X-Forwarded-For header, read: http://www.squid-cache.org/Doc/config/forwarded_for/
I dont get any credit for the answer I copied this answer from the following post I found Python Requests module - proxy not working
I am trying to scrape a website using requests in python.
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
# set the headers like we are a browser,
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers )
This is working fine when I use my personal wifi. However, when I connect to my company's VPN, I get the following error.
ConnectionError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/23013220/max-retries-exceeded-with-url (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))
Now, I need this to work over my company's VPN because I need to access a website which works only in that. How to resolve this?
In my case, the problem was related to IPv6.
Our VPN used split tunneling, and it seems the VPN configuration does not support IPv6.
So for example this would hang forever:
requests.get('https://pokeapi.co/api/v2/pokemon')
But if you add a timeout, the request succeeds:
requests.get('https://pokeapi.co/api/v2/pokemon', timeout=1)
But not all machines were having this problem. So I compared the output of this among two different machines:
import socket
for line in socket.getaddrinfo('pokeapi.co', 443):
print(line)
The working one only returned IPv4 addresses. The non-working machine returned both IPv4 and IPv6 addresses.
So with the timeout specified, my theory is that python fails quickly with IPv6 and then moves to IPv4, where the request succeeds.
Ultimately we resolved this by disabling IPv6 on the machine:
networksetup -setv6off "Wi-Fi"
But I assume that this could instead be resolved through VPN configuration.
How about trying like this:
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
ua = UserAgent()
headers = headers = {"User-Agent": ua.random}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers)
It seems to be caused by UserAgent() settings difference.
Try to set trust_env = None
trust_env = None #
Trust environment settings for proxy configuration, default authentication and similar.
Or you can disable proxies for a particular domain. The question
import os
os.environ['NO_PROXY'] = 'stackoverflow.com'
In my organization, I have to run my program under VPN for different geo locations. so we have multiple proxy configurations.
I found it simpler to use a package called PyPAC to get my proxy details automatically
from pypac import PACSession
from requests.auth import HTTPProxyAuth
session = PACSession()
# when the username and password is required
# session = PACSession(proxy_auth=HTTPProxyAuth(name, password))
r = session.get('http://example.org')
How does this work:
The package locates the PAC file which is configured by the organization. This file consist of proxy configuration detail (more info).
I'm using python requests to send http requests to www.fredmeyer.com
I can't even get past an initial get request to this domain. doing a simple requests.get results in the connection hanging and never timing out. i've verified i have access to this domain and am able to run the request on my local machine. can anyone replicate
The site seems to have some filtering enabled to prohibit bots or similar. The following HTTP request works currently with the site:
GET / HTTP/1.1
Host: www.fredmeyer.com
Connection: keep-alive
Accept: text/html
Accept-Encoding:
If the Connection header is removed or its value changed to close it will hang. If the (empty) Accept-Encoding header is missing it will also hang. If the Accept line is missing it will return 403 Forbidden.
In order to access this site with requests the following currently works for me:
import requests
headers = { 'Accept':'text/html', 'Accept-Encoding': '', 'User-Agent': None }
resp = requests.get('https://www.fredmeyer.com', headers=headers)
print(resp.text)
Note that the heuristics used by the site to detect bots might change, so this might stop working in the future.
I'm trying to automate form filling on Sharepoint site, but my Python script can't get passed this authentication box that pops up when you type in the url from below.
from base64 import b64encode
import mechanize
url = 'http://moss.micron.com/MFG/ProbeTest/Lists/Manufacturing%20Requests/AllItems.aspx'
username = 'username'
password = 'password'
# I have had to add a carriage return ('%s:%s\n'), but
# you may not have to.
b64login = b64encode('%s:%s' % (username, password))
br = mechanize.Browser()
br.addheaders.append(
('Authorization', 'Basic %s' % b64login, )
)
br.open(url)!
This results in the following error:
EDIT:
Here are the results of running wget on the requested page.
--2013-08-30 11:16:17-- http://moss.micron.com/MFG/ProbeTest/Lists/Manufacturing%20Requests/AllItems.aspx
Resolving moss.micron.com... 137.201.88.118
Connecting to moss.micron.com|137.201.88.118|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 401 Unauthorized
Server: Microsoft-IIS/7.0
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 12.0.0.6341
Date: Fri, 30 Aug 2013 17:16:17 GMT
Connection: keep-alive
Content-Length: 0
Authorization failed.
Your browser is respecting the robots.txt on your site disallowing it.
You can set mechanize.Browser to ignore robots.txt, prior to making the request via:
br.set_handle_robots(False)
Alternately, edit your robots.txt to allow that sort of connection.
If you set a custom UserAgent header in your mechanize.Browser to allow you to filter for it.
See here for basic info about robots.txt.
If you can get to the site with a PC, download Fiddler2 which will allow you to see the transactions required when you log in.
Edit.. Ok. Obviously you have a PC.
I was wondering, how do you close a connection with Requests (python-requests.org)?
With httplib it's HTTPConnection.close(), but how do I do the same with Requests?
Code:
r = requests.post("https://stream.twitter.com/1/statuses/filter.json", data={'track':toTrack}, auth=('username', 'passwd'))
for line in r.iter_lines():
if line:
self.mongo['db'].tweets.insert(json.loads(line))
I think a more reliable way of closing a connection is to tell the sever explicitly to close it in a way compliant with HTTP specification:
HTTP/1.1 defines the "close" connection option for the sender to
signal that the connection will be closed after completion of the
response. For example,
Connection: close
in either the request or the response header fields indicates that the
connection SHOULD NOT be considered `persistent' (section 8.1) after
the current request/response is complete.
The Connection: close header is added to the actual request:
r = requests.post(url=url, data=body, headers={'Connection':'close'})
I came to this question looking to solve the "too many open files" error, but I am using requests.session() in my code. A few searches later and I came up with an answer on the Python Requests Documentation which suggests to use the with block so that the session is closed even if there are unhandled exceptions:
with requests.Session() as s:
s.get('http://google.com')
If you're not using Session you can actually do the same thing: https://2.python-requests.org/en/master/user/advanced/#session-objects
with requests.get('http://httpbin.org/get', stream=True) as r:
# Do something
As discussed here, there really isn't such a thing as an HTTP connection and what httplib refers to as the HTTPConnection is really the underlying TCP connection which doesn't really know much about your requests at all. Requests abstracts that away and you won't ever see it.
The newest version of Requests does in fact keep the TCP connection alive after your request.. If you do want your TCP connections to close, you can just configure the requests to not use keep-alive.
s = requests.session()
s.config['keep_alive'] = False
please use response.close() to close to avoid "too many open files" error
for example:
r = requests.post("https://stream.twitter.com/1/statuses/filter.json", data={'track':toTrack}, auth=('username', 'passwd'))
....
r.close()
On Requests 1.X, the connection is available on the response object:
r = requests.post("https://stream.twitter.com/1/statuses/filter.json",
data={'track': toTrack}, auth=('username', 'passwd'))
r.connection.close()
this works for me:
res = requests.get(<url>, timeout=10).content
requests.session().close()
Based on the latest requests(2.25.1), the requests.<method> will close the connection by default
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)
https://github.com/psf/requests/blob/master/requests/api.py#L60
Thus, if you use the latest version of requests, it seems we don't need to close the connection by ourselves.
Also, if you need to send multiple times of requests with the same session, it's better to use requests.Session() instead of open/close the connection multiple times.
EX:
with requests.Session() as s:
r = s.get('https://example.org/1/')
print(r.text)
r = s.get('https://example.org/2/')
print(r.text)
r = s.get('https://example.org/3/')
print(r.text)
To remove the "keep-alive" header in requests, I just created it from the Request object and then send it with Session
headers = {
'Host' : '1.2.3.4',
'User-Agent' : 'Test client (x86_64-pc-linux-gnu 7.16.3)',
'Accept' : '*/*',
'Accept-Encoding' : 'deflate, gzip',
'Accept-Language' : 'it_IT'
}
url = "https://stream.twitter.com/1/statuses/filter.json"
#r = requests.get(url, headers = headers) #this triggers keep-alive: True
s = requests.Session()
r = requests.Request('GET', url, headers)