Specifying proxies for http vs https sites when using Python Requests - python

I have a Python script to get a page using Requests. I need to use proxies to access the page. When I access an http page, it goes through the proxy but when I access an https page, it does not go through the proxy (I used logs to check this, as explained below). I have checked with the proxy service provider (proxymesh) and they said that their proxies can be used for https pages as well. Is there anything I need to change in the script when accessing https sites vs http sites?
My code is presented below. At the end of this question, I have included the log files generated for the http and https sites, which show that the proxy is used for http but not for https.
Any ideas will be really helpful.
import logging
import requests
#set up logging
logging.getLogger('').handlers = []
logging.basicConfig(
filename = "mylog_with_proxy.log", #in my code, the full path is specified
filemode="w",
level = logging.DEBUG)
#specify proxies and headers
proxies = {'http': 'http://fr.proxymesh.com:31280', 'https': 'http://fr.proxymesh.com:31280'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393',}
#the two URLs that I accessed. One is for an http site and the other one is for an https site. These sites are just examples of sites I need to access.
http_url = "http://docs.python-requests.org/en/master/user/quickstart/"
https_url = "https://www.haskell.org/happy/"
#get the page. I executed the script twice - once for http_url and the second time for https_url. Here, it shows http_url
r = requests.get(http_url, headers=headers, proxies=proxies, timeout=5)
r.raise_for_status()
The log files are as shown below:
When accessing the http site (that is, when running the script with http_url):
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): fr.proxymesh.com
DEBUG:requests.packages.urllib3.connectionpool:"GET http://docs.python-requests.org/en/master/user/quickstart/ HTTP/1.1" 200 None
When accessing the https site (that is, when running the script with https_url)
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.haskell.org
DEBUG:requests.packages.urllib3.connectionpool:"GET /happy/ HTTP/1.1" 200 None

Related

Url requests not working while the flask app is hosted

I have a flask web app running a just-dial scraper code, In my code, I have to request multiple pages of the Justdial site to use it in the bs4 module to extract the data and fill it in the excel sheet. I use requests.Session() to do the process.
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"})
url=f"{entry}/page-{page_number}"
session.verify = False
r = session.get(url).text
Then this "r" is passed into the bs4 module and the extraction process takes place.
Whenever I run this code in the local host my program works fine, the data is getting extracted and the values are getting stored in the excel file. But when I host this as webapp in heroku and try the same process in heroku, I am not getting the desired output, there are no errors shown in except and try as well. Also I am getting empty excel file as output.
I tried using Urllib, requests.get() and also requests.get(url, verify-False) but the same problem exists.
This warning pops up while i run the program in localhost
/home/disciple/.local/lib/python3.8/site-packages/urllib3/connectionpool.py:846: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn((

python requests.get(url) times out but works in browser (chrome); how can I tailor the request headers for a certain host?

I am trying to download a file using the python requests module, my code works for some urls/hosts but I've come across one that does not work.
Based on other similar questions it may be related to the User-Agent request header, I have tried to remedy by adding the chrome user-agent but the connection still times out for this particular url (it does work for others).
I have tested opening the url in chrome browser (which works all OK) and inspecting the request headers, but I still can't figure out why my code is failing:
import requests
url = 'http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Indices-2020-03.csv'
headers = {'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
session = requests.Session()
session.headers.update(headers)
response = session.get(url, stream=True)
# !!! code fails here for this particular url !!!
with open('test.csv', "wb") as fh:
for x in response.iter_content(chunk_size=1024):
if x: fh.write(x)
Update 2020-08-14
I have figured out what was wrong; on the instances where the code was working the urls were using https protocol. This url is http protocol, and my proxy settings were not configured for http only https. After providing a http proxy to requests my code did work as written.
The code you posted worked for me, it saved the file (129007 lines). It could be that the host is rate-limiting you, try again later to see if it works.
# count lines
$ wc -l test.csv
129007 test.csv
# inspect headers
$ head -n 4 test.csv
Date,Region_Name,Area_Code,Index
1968-04-01,Wales,W92000004,2.11932727
1968-04-01,Scotland,S92000003,2.108087275
1968-04-01,Northern Ireland,N92000001,3.300419757
You can disable requests' timeouts by passing timeout=None. Here is the official documentation: https://requests.readthedocs.io/en/master/user/advanced/#timeouts

Requests + Proxy Servers, IP address won't change

I am using the python shell to test requests together with proxy servers.
After reading documentation (http://docs.python-requests.org/en/master/user/advanced/) and a few stackoverflow threads I am doing the following:
import requests
s = requests.session()
proxies = {'http': 'http://90.178.216.202:3128'}
s.proxies.update(proxies)
req = s.get('http://jsonip.com')
After this, if I print req.text, I get this:
u'{"ip":"my current IP (not the proxy server IP I have inserted before)","about":"/about", ......}'
Can you please explain why I'm getting my computer's IP address and not the proxy server's IP address?
Did I go wrong somewhere or am I expecting the wrong thing to happen here?
I am new to requests + proxy servers so I would like to make sure I am understanding this.
UPDATE
I also have this in my code:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
s.headers.update(headers)
Thanks
Vittorio
The site ( http://jsonip.com ) broadcasts an 'Upgrade-Insecure-Requests' header. This means that your request gets redirected to https://jsonip.com, so requests doesn't use a proxy because you don't have an https proxy in your proxies dict.
So, all you have to do is add an https proxy in proxies , eg:
proxies = {'http':'http://90.178.216.202:3128', 'https':'https://90.178.216.202:3128'}
Instead of doing this pass user-agent
requests.post(url='abc.com',header={'user-agent':'Mozila 5.0'})
u need to change ur get request to have the proxies used.
something like this:req = s.get('http://jsonip.com', proxies=proxies)

Using URLFetch in Python GAE to fetch a complete document

I am using urlfetch.fetch in App engine using Python 2.7.
I tried fetching 2 URLs belonging to 2 different domains. For the first one, the result of urlfetch.fetch includes results after resolving XHR queries that are made for getting recommended products.
However for the other page belonging to another domain, the XHR queries are not resolved and I just get the plain HTML for the most part. The XHR queries for this page are also made for purposes of getting recommended products to show, etc.
Here is how I use urlfetch:
fetch_result = urlfetch.fetch(url, deadline=5, validate_certificate=True)
URL 1 (the one where XHR is resolved and the response is complete)
https://www.walmart.com/ip/HP-15-f222wm-ndash-15.6-Laptop-Touchscreen-Windows-10-Home-Intel-Pentium-Quad-Core-Processor-4GB-Memory-500GB-Hard-Drive/53853531
URL 2 (the one where I just get the plain HTML for the most part)
https://www.flipkart.com/oricum-blue-486-loafers/p/itmezfrvwtwsug9w?pid=SHOEHZWJUMMTEYRU
Can someone please advice what I may be missing in regards to the inconsistency.
The server is serving different output based on the user-agent string supplied in the request headers.
By default, urlfetch.fetch will send requests with the user agent header set to something like AppEngine-Google; (+http://code.google.com/appengine; appid: myapp.appspot.com.
A browser will send a user agent header like this: Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
If you override the default headers for urlfetch.fetch
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'}
urlfetch.fetch(url, headers=headers)
you will find that the html that you receive is almost identical to that served to the browser.

How to check Proxy headers to check anonymity?

I'm trying to determine high anonymity proxies. Also called private/elite proxies. From a forum I've read this:
High anonymity Servers don't send HTTP_X_FORWARDED_FOR, HTTP_VIA and
HTTP_PROXY_CONNECTION variables. Host doesn't even know you are using
proxy server and of course it doesn't know your IP address.
A highly anonymous proxy will display the following information:
REMOTE_ADDR = Proxy's IP address
HTTP_VIA = blank
HTTP_X_FORWARDED_FOR = blank
So, how I can check for this headers in Python, to discard them as a HA Proxy ? I have tried to retrieve the headers for 20-30 proxies using the requests package, also with urllib, with the build-in http.client, with urllib2. But I didn't see these headers, never. So I should be doing something wrong...
This is the code I've used to test with requests:
proxies = {'http': 'http://176.100.108.214:3128'}
header = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.360',}
s = requests.session()
s.proxies = proxies
r = s.get('http://www.python.org', headers=header)
print(r.status_code)
print(r.request.headers)
print(r.headers)
It sounds like the forum post you're referring to is talking about the headers seen by the server on your proxied request, not the headers seen by the client on the proxied response.
Since you're testing with www.python.org as the server, the only way to see the headers it receives would be to have access to their logs. Which you don't.
But there's a simple solution: run your own HTTP server, make requests against that, and then you can see what it receives. (If you're behind a firewall or NAT that the proxy you're testing won't be able to connect to, you may have to get a free hosted server somewhere; if not, you can just run it on your machine.)
If you have no idea how to set up and configure a web server, Python comes with one of its own. Just run this script with Python 3.2+ (on your own machine, or an Amazon EC2 free instance, or whatever):
from http.server import HTTPServer, SimpleHTTPRequestHandler
class HeaderDumper(SimpleHTTPRequestHandler):
def do_GET(self):
try:
return super().do_GET()
finally:
print(self.headers)
server = HTTPServer(("", 8123), HeaderDumper)
server.serve_forever()
Then run that script with python3 in the shell.
Then just run your client script, with http://my.host.ip instead of http://www.python.org, and look at what the script dumps to the server's shell.

Categories