POST request failing after migrating from requests to urllib2 - python

I have moved away from using the python requests library as it was a bit fiddly to get working on Google App Engine. Instead I'm using urllib2 which has better support. Unfortunately the POST request that previously worked with the requests library no longer works with urllib2.
With requests the code was as follows
values = { 'somekey' : 'somevalue'}
r = requests.post(some_url, data=values)
With urllib2, the code is as follows
values = { 'somekey' : 'somevalue'}
data = urllib.urlencode(values)
req = urllib2.Request(some_url, data)
response = urllib2.urlopen(req)
Unfortunately the latter raises the following error
HTTP Error 405: Method Not Allowed
The url I'm posting to has the following form:
some_url = 'http://ec2-11-111-111-1.compute-1.amazonaws.com'
I have read that there is an issue with trailing slashes with urllib2, however when I add the slash as follows
some_url = 'http://ec2-11-111-111-1.compute-1.amazonaws.com/'
I still get the same error.
There is no redirection at the destination - so the request shouldn't be transformed to a GET. Even if it were, the url actually accepts GET and POST requests. The EC2 linux instance is running django and I can see that it successfully receives the request and makes use of it - so doesn't return a 405. For some reason urllib2 is picking up a 405 and throwing the exception.
Any ideas as to what may be going wrong?
Edit 1:
As per #philip-tzou 's good call, the following information might help
print req.get_method()
yields POST
print req.header_items()
yields []
Edit 2
Adding a user agent header (as #padraic-cunningham suggests) didn't solve it unfortunately. I added the same header shown in the urllib2 example
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
data = urllib.urlencode(values)
req = urllib2.Request(some_url, data, headers)
Edit 3
As #furas suggested, I've sent the request I'm making to requestb.in to double check what's being sent. It is indeed a POST request being made, with the following headers:
Connection: close
Total-Route-Time: 0
X-Cloud-Trace-Context: ed073df03ccd05657<removed>2a203/<removed>129639701404;o=5
Connect-Time: 1
Cf-Ipcountry: US
Cf-Ray: <removed>1d155ac-ORD
Cf-Visitor: {"scheme":"http"}
Content-Length: 852
Content-Type: application/x-www-form-urlencoded
Via: 1.1 vegur
X-Request-Id: 20092050-5df4-42f8-8fe0-<removed>
Accept-Encoding: gzip
Cf-Connecting-Ip: <removed>
Host: requestb.in
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppEngine-Google; (+http://code.google.com/appengine; appid: s~<removed>)
and now req.header_items() yields [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)')]
by way of comparison, the headers from the requests POST are
Cf-Connecting-Ip: <removed>
Accept-Encoding: gzip
Host: requestb.in
Cf-Ipcountry: US
Total-Route-Time: 0
Cf-Ray: 2f42a45185<removed>-ORD
Connect-Time: 1
Connection: close
Cf-Visitor: {"scheme":"http"}
Content-Type: application/x-www-form-urlencoded
Content-Length: 852
X-Cloud-Trace-Context: 8932f0f9b165d9a0f698<removed>/1740038835186<removed>;o=5
Accept: */*
X-Request-Id: c88a29e1-660e-4961-8112-<removed>
Via: 1.1 vegur
User-Agent: python-requests/2.11.1 AppEngine-Google; (+http://code.google.com/appengine; appid: s~<removed>)

Related

Crawler got identified when headers provided and requst interval set

Steam provides an endpoint which shows market information of a specific item. Parameter includes country, currency, language, item_nameid, and two_factor.
For example,
https://steamcommunity.com/market/itemordershistogram?country=TW&language=english&currency=30&item_nameid=176185978&two_factor=0
means my country, currency, and language are TW, 30(TWD), and english. ID of the item I'm interesting in is 17618597.
Normally, you can continually access this endpoint without any problem after loggin. I tried accessing this endpoint thirty times within one minute, no problem occurred.
But when using requests library to access this endpoint, steam quickly identified crawling even when user-agent, cookies, other headers provided and requst interval set. I got ip banned after 8 to 12 requests.
My code:
url = 'https://steamcommunity.com/market/itemordershistogram'
payload = {'country': 'TW', 'language': 'english', 'currency': 30, 'item_nameid': 176185978, 'two_factor': 0}
# headers copied from network tab
temp = '''Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: keep-alive
Host: steamcommunity.com
Pragma: no-cache
sec-ch-ua: "Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'''
headers = {key: value for key, value in [line.split(': ') for line in temp.split('\n')]} // conver headers from string literals to dict
temp = 'cookies copied from network tab'
cookies = {key: value for key, value in [pair.split('=') for pair in temp.split('; ')]} // conver cookies from string literals to dict
session = Session()
for i in range(100):
print(f'{i}:')
res = session.get(url, headers=headers, params=payload, cookies=cookies)
print(res.status_code)
print(res.text)
sleep_time = 5 + random.random()
time.sleep(sleep_time)
I do some research on how to deal with anti-spider, most article saying setting user-agent, cookies or extending request interval would help. But in my case, steam still be able to identified crawling.
Some article says adding Referer header would help, but I see no Referer header in browser's network tab.
Therefore, I assume adding Referer header wouldn't help.
So I'm wondering, besides user-agent, cookies, other headers and the request interval, how can steam know I'm using a crawler?

Using Python and requests module to post

There are similar questions posted, but I still seem to have a problem. I am expecting to receive a registration email after running this. I receive nothing. Two questions. What is wrong? How would I even know if the data was successfully submitted as opposed to the page just loading normally?
serviceurl = 'https://signup.com/'
payload = {'register-fname': 'Peter', 'register-lname': "Parker", 'register-email': 'xyz#email.com', 'register-password': '9dlD313kF'}
r2 = requests.post(serviceurl, data=payload)
print(r2.status_code)
The url for the POST request is actually https://signup.com/api/users, and it returns 200 (in my browser).
You need to replicate what your browser does. This might include certain request headers.
You will want to use your browser's dev tools/network inspector to gather this information.
The information below it from my Firefox on my computer:
Request headers:
Host: signup.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Content-Type: application/json;charset=utf-8
Content-Length: 107
Origin: https://signup.com
Connection: keep-alive
Referer: https://signup.com/
Cookie: _vspot_session_id=ce1937cf52382239112bd4b98e0f1bce; G_ENABLED_IDPS=google; _ga=GA1.2.712393353.1584425227; _gid=GA1.2.1095477818.1584425227; __utma=160565439.712393353.1584425227.1584425227.1584425227.1; __utmb=160565439.2.10.1584425227; __utmc=160565439; __utmz=160565439.1584425227.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; __qca=P0-1580853344-1584425227133; _gat=1
Pragma: no-cache
Cache-Control: no-cache
Payload:
{"status":true,"code":null,"email":"TestEmail#hotmail.com","user":{"id":20540206,"email":"TestEmail#hotmail.com","name":"TestName TestSurname","hashedpassword":"4ffdbb1c33d14ed2bd02164755c43b4ad8098be2","salt":"700264767700800.7531319164902858","accesskey":"68dd25c3ae0290be69c0b59877636a5bc5190078","isregistered":true,"activationkey":"f1a6732b237379a8a1e6c5d14e58cf4958bf2cea","isactivated":false,"chgpwd":false,"timezone":"","phonenumber":"","zipcode":"","gender":"N","age":-1,"isdeferred":false,"wasdeferred":false,"deferreddate":null,"registerdate":"2020/03/17 06:09:27 +0000","activationdate":null,"addeddate":"2020/03/17 06:09:27 +0000","admin":false,"democount":0,"demodate":null,"invitationsrequest":null,"isvalid":true,"timesinvalidated":0,"invaliddate":null,"subscribe":0,"premium":false,"contributiondate":null,"contributionamount":0,"premiumenddate":null,"promo":"","register_token":"","premiumstartdate":null,"premiumsubscrlength":0,"initial_reg_type":"","retailmenot":null,"sees":null,"created_at":"2020/03/17 06:09:27 +0000","updated_at":"2020/03/17 06:09:27 +0000","first_name":"TestName","last_name":"TestSurname"},"first_name":"TestName","last_name":"TestSurname","mobile_redirect":false}
There's a lot to replicate. Things like the hashed password, salt, dates, etc would have been generated by JavaScript executed by your browser.
Keep in mind, the website owner might not appreciate a bot creating user accounts.

Requests: Python Post Request only returns error code 400

(I'm new to posting on stack overflow tell me stuff to improve on ty)
I recently got into python requests and I am a bit new to web stuff related to python so if its something stupid don't flame me. I've been trying a lot of ways to get one of my post events to not return error 400. I messed with headers and data and a bunch of other stuff and I would either get code 404 or 400.
I have tried adding a bunch of headers to my project. Im trying to login to the site "Roblox" making a auto purchase bot. The Csrf Token was not in cookies so I found a loop hole getting the token with sending a request to a logout url (If anyone gets confused). I double checked the Data I inputted aswell and it is indeed correct. I was researching for multiple hours and I couldn't find a way to fix so I came to stack overflow for the first time.
RobloxSignInDeatils = {
"cvalue": "TestWebDevAcc", #Username
"ctype": "Username", #Leave Alone
"password": "Test123123" #Password
}
def GetToken(Session, Headers):
LogoutUrl = "https://api.roblox.com/sign-out/v1" #Url to get CSRF
Request = Session.post(LogoutUrl, Headers) #Sent Post
print("Token: ", Request.headers["X-CSRF-TOKEN"]) #Steal the token
return Request.headers["X-CSRF-TOKEN"] #Return Token
with requests.session() as CurrentSession:
Header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36"
} #Get the UserAgent Header Only
CurrentSession.get("https://www.roblox.com/login", headers=Header) #Send to the login page
CsrfToken = GetToken(CurrentSession, Header) #Get The Token
Header["X-CSRF-TOKEN"] = CsrfToken #Set the token
SignIn = CurrentSession.post("https://auth.roblox.com/v2/login",
data=RobloxSignInDeatils, headers=Header) # Send Post to Sign in
print(SignIn.status_code) #Returns 400 ?
Printing "print(SignIn.status_code)" just returns 400 nothing else to explain
EDIT: If this helps heres a list of ALL the headers:
Request URL: https://auth.roblox.com/v2/login
Request Method: OPTIONS
Status Code: 200 OK
Remote Address: 128.116.112.44:443
Referrer Policy: no-referrer-when-downgrade
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: X-CSRF-TOKEN,Content-Type,Pragma,Cache-Control,Expires
Access-Control-Allow-Methods: OPTIONS, TRACE, GET, HEAD, POST, DELETE, PATCH
Access-Control-Allow-Origin: https://www.roblox.com
Access-Control-Max-Age: 600
Cache-Control: max-age=120, private
Content-Length: 0
Date: Thu, 14 Feb 2019 04:04:13 GMT
P3P: CP="CAO DSP COR CURa ADMa DEVa OUR IND PHY ONL UNI COM NAV INT DEM PRE"
Roblox-Machine-Id: CHI1-WEB2875
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Access-Control-Request-Headers: content-type,x-csrf-token
Access-Control-Request-Method: POST
Connection: keep-alive
Host: auth.roblox.com
Origin: https://www.roblox.com
Referer: https://www.roblox.com/
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.96 Safari/537.36
Payload:
{cvalue: "TestWebDevAcc", ctype: "Username", password: "Test123123"}

What is the suggested way to build and send a http request in Python

I wrote a small module about 1 year age.
When I tried to add some feature to it these days, I found there is a big change in python's urllib and I was confused about how to build a request.
In the old module FancyURLopener was used as my base class, yet I found it was Deprecated since version 3.3.
So I read the document again, and try to build a request instead of a opener.
However, when I tried to add headers, only one function Request.add_header(key, val) was provided. I have headers copied from fiddler like this:
GET some_url HTTP/1.1
Host: sosu.qidian.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: zh-cn,en-us;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
Referer: anothe_url
Cookie: a lot of data
Connection: keep-alive
So do I have to add them to the request one by one?
Also I found another openner urllib.request.build_opener() which can add a lot of headers in one time. But I could not set method 'get' or 'post'.
I am a newbie on python, any suggestions ?
By far the best way to make http requests in Python is to install and use this module:
http://docs.python-requests.org/en/latest/
You can write code like this:
headers = {
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'anothe_url',
# etc
}
response = requests.get(some_url, headers=headers)
http://docs.python-requests.org/en/latest/user/quickstart/#custom-headers

Problems with Twitter REST API 1.1 - App Auth only response 403 Error with Python

I'm trying to connect to the twiiter API through a POST request as the docs say but I always get a 403 forbidden error.
This is my code. I'm using urlib2 in python 2.7:
def auth_API():
url = 'https://api.twitter.com/oauth2/token'
header = {}
values = {}
header['User-Agent'] = 'Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1'
header['Authorization'] = 'Basic ' + B64BEARERTOKENCREDS
header['Content-Type'] = 'application/x-www-form-urlencoded;charset=UTF-8'
header['Accept-Encoding'] = 'gzip'
values['grant_type'] = 'client_credentials'
data = urllib.urlencode(values)
req = urllib2.Request(url, data, header)
try:
response = urllib2.urlopen(req)
response.read()
except urllib2.HTTPError as e:
print e
Checking the docs I found an example request wich is the same as mine:
Twitter example:
POST /oauth2/token HTTP/1.1
Host: api.twitter.com
User-Agent: My Twitter App v1.0.23
Authorization: Basic NnB1[...]9JM29jYTNFOA==
Content-Type: application/x-www-form-urlencoded;charset=UTF-8
Content-Length: 29
Accept-Encoding: gzip
grant_type=client_credentials
My request:
POST /oauth2/token HTTP/1.1
Content-Length: 29
Accept-Encoding: gzip
Connection: close
User-Agent: Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
Host: api.twitter.com
Content-Type: application/x-www-form-urlencoded;charset=UTF-8
Authorization: Basic NnB1[...]YTNFOA==
grant_type=client_credentials
Any idea with what could be wrong with this?
Regards.
PS: I know that there are some third party libs for this but I want to make it by myself.
I solved my problem, the error was with base64.encodestring() which adds an \n at the end of the string messing up the request.
Using base64.b64encode() instead worked fine.
Regards

Categories