I am trying to use Proxies for my WebScraping Project, which i did build with HTTPX.
However when i was setting up my proxies i still got blocked, so i tryed out if the actully work/get used. I bought some proxys from an professional website/seller, so they work just fine.
I found a website, which returns the IP, from which i am making the request.
I Tryed to test the USE of proxies like that:
import httpx
import requests
#Username:PW:Hostname
proxies = {"http://": "http://username:pw.io:hostname"}
#response = requests.get('http://ipinfo.io/json',proxies=proxies)
response = httpx.get('http://ipinfo.io/json',proxies=proxies)
print(response.text)
Both requests and httpx dont work for me, as the response always returns my real IP. How do i need to set up my Proxiex? Keep in mind, that i actually want to use HTTPX and just used requests for debugging aswell.
Related
I am trying to make a request to a server using Python Requests and it returns a 403. The page works fine using my browser and using urllib.
The headers are identical. I even tried using an ordered dict to make sure the header ordering is identical, but it still won't work.
Then I tried to see the SSL differences, and found that the main difference between the 3 (my browser, requests, and urllib) is that requests doesn't support TLS session tickets.
url="https://www.howsmyssl.com/a/check"
import requests
req = requests.get(url=url)
print(req.text)
import urllib
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
print(response.read())
The Cipher Suite is almost identical across the 3. The TLS version is 1.3 across all. But the session_ticket_supported is true only for the browser and urllib (both of which work) and is false for requests (which returns 403).
So I assumed that the problem is there.
I dug deeper and learned that requests is actually using urllib3, but I got stuck at confirming which SSL adapter they use and how to configure it.
Any ideas on how to enable TLS session tickets for requests? Or maybe I am looking in the wrong place here?
PS. I am using Python 3.9.13 and the latest versions for all packages
PSS. curl also supports session tickets on my system and can access the server fine
I'm relatively new to Python so excuse any errors or misconceptions I may have. I've done hours and hours of research and have hit a stopping point.
I'm using the Requests library to pull data from a website that requires a login. I was initially successful logging in through through a session.post,(payload)/session.get. I had a [200] response. Once I tried to view the JSON data that was beyond the login, I hit a [403] response. Long story short, I can make it work by logging in through a browser and inspecting the web elements to find the current session cookie and then defining the headers in requests to pass along that exact cookie with session.get
My questions is...is it possible to set/generate/find this cookie through python after logging in? After logging in and out a few times, I can see that some of the components of the cookie remain the same but others do not. The website I'm using is garmin connect.
Any and all help is appreciated.
If your issue is about login purposes, then you can use a session object. It stores the corresponding cookies so you can make requests, and it generally handles the cookies for you. Here is an example:
s = requests.Session()
# all cookies received will be stored in the session object
s.post('http://www...',data=payload)
s.get('http://www...')
Furthermore, with the requests library, you can get a cookie from a response, like this:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies
But you can also give cookie back to the server on subsequent requests, like this:
url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
I hope this helps!
Reference: How to use cookies in Python Requests
I am trying to log into a website. When I look at print(g.text) I am not getting back the web page I expect but instead a cloudflare page that says 'Checking your browser before accessing'
import requests
import time
s = requests.Session()
s.get('https://www.off---white.com/en/GB/')
headers = {'Referer': 'https://www.off---white.com/en/GB/login'}
payload = {
'utf8':'✓',
'authenticity_token':'',
'spree_user[email]': 'EMAIL#gmail.com',
'spree_user[password]': 'PASSWORD',
'spree_user[remember_me]': '0',
'commit': 'Login'
}
r = s.post('https://www.off---white.com/en/GB/login', data=payload, headers=headers)
print(r.status_code)
g = s.get('https://www.off---white.com/en/GB/account')
print(g.status_code)
print(g.text)
Why is this occurring when I have set the session?
You might want to try this:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
It does not require Node.js dependency.
All credits go to this pypi page
This is due to the fact that the page uses Cloudflare's anti-bot page (or IUAM).
Bypassing this check is quite difficult to solve on your own, since Cloudflare changes their techniques periodically. Currently, they check if the client supports JavaScript, which can be spoofed.
I would recommend using the cfscrape module for bypassing this. To install it, use pip install cfscrape. You'll also need to install Node.js.
You can pass a requests session into create_scraper() like so:
session = requests.Session()
session.headers = ...
scraper = cfscrape.create_scraper(sess=session)
I had the same problem because they implemented cloudfare in the api, I solved it this way
import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get("MY API").text
y = json.loads(r)
print (y)
You can scrape any Cloudflare protected page by using this tool. Node.js is mandatory in order for the code to work correctly.
Download Node from this link https://nodejs.org/en/
import cfscrape #pip install cfscrape
scraper = cfscrape.create_scraper()
res = scraper.get("https://www.example.com").text
print(res)
curl and hx avoid this problem. But how?
I found, they work by default with HTTP/2. But requests library used only HTTP/1.1.
So, for tests I installed httpx with h2 python library to support HTTP/2 requests) and it works if I do: httpx --http2 'https://some.url'.
So, the solution is to use a library that supports http2. For example httpx with h2
It's not a complete solution, since it won't help to solve Cloudflare's anti-bot ("I'm Under Attack Mode", or IUAM) challenge
I am using a proxy server to connect to several target servers. Some of the target servers expect http and others expect https. My http requests work swimmingly, but urllib2 ignores the proxy handler on the https requests and sends the requests directly to the target server.
I've tried a number of different things but here is one reasonably concise attempt:
import urllib2
cookie_handler = urllib2.HTTPCookieProcessor (cookielib.LWPCookieJar())
proxies = {'http': 'http://123.456.78.9/',
'https': 'http://123.45.78.9/'}
proxy_handler = urllib2.ProxyHandler (proxies)
url_opener = urllib2.build_opener (proxy_handler, cookie_handler)
request = urllib2.Request ('https://example.com')
response = url_opener.open (request)
I understand that urllib2 has had the ability to send https requests to a proxy server since Python 2.6.3, but I can't seem to get it to work. I'm using 2.7.3.
Thanks for any advice you can offer.
UPDATE: The code above does work. I'm not certain why it wasn't working when I asked this question. Most likely, I had a typo in the https proxy URL.
I'm working on a simple HTML scraper for Hulu in python 2.6 and am having problems with logging on to my account. Here's my code so far:
import urllib
import urllib2
from cookielib import CookieJar
#make a cookie and redirect handlers
cookies = CookieJar()
cookie_handler= urllib2.HTTPCookieProcessor(cookies)
redirect_handler= urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)#make opener w/ handlers
#build the url
login_info = {'username':USER,'password':PASS}#USER and PASS are defined
data = urllib.urlencode(login_info)
req = urllib2.Request("http://www.hulu.com/account/authenticate",data)#make the request
test = opener.open(req) #open the page
print test.read() #print html results
The code compiles and runs, but all that prints is:
Login.onError("Please \074a href=\"/support/login_faq#cant_login\"\076enable cookies\074/a\076 and try again.");
I assume there is some error in how I'm handling cookies, but just can't seem to spot it. I've heard Mechanize is a very useful module for this type of program, but as this seems to be the only speed bump left, I was hoping to find my bug.
What you're seeing is a ajax return. It is probably using javascript to set the cookie, and screwing up your attempts to authenticate.
The error message you are getting back could be misleading. For example the server might be looking at user-agent and seeing that say it's not one of the supported browsers, or looking at HTTP_REFERER expecting it to be coming from hulu domain. My point is there are two many variables coming in the request to keep guessing them one by one
I recommend using an http analyzer tool, e.g. Charles or the one in Firebug to figure out what (header fields, cookies, parameters) the client sends to server when you doing hulu login via a browser. This will give you the exact request that you need to construct in your python code.