Chunked encoded messages in Python Requests Library - python

I am currently developing a program in Python 3 that shall send a chunked encoded message in a POST Request to an arbitrary URL. As described in the docs of the library Requests (see "chunked-encoded requests" under https://requests.readthedocs.io/en/master/user/advanced/), one can use a Generator to specify a chunked encoded message. At the moment, I can not verify whether my Generator gets the job done, because everytime i try to print out the message body of the request with
print(response.request.body)
i get the following:
<generator object generator at 0x03FEB530>
For example, I want to send the following message in a request. (with Content-Type: application/x-www-form-urlencoded)
x=FOO
In a chunked encoded message, the message body shall look like this:
5\r\n
x=FOO\r\n
0\r\n
\r\n
In a whole request, it should look like this:
POST / HTTP/1.1
Host: example.com
Transfer-Encoding: chunked
5
x=FOO
0
I wrote my own Generator and wanted to ask whether this is the correct way of doing it. You can see it in the code snippet below.
def generator():
var1 = "5\r\n"
var2= "x=FOO\r\n"
var3= "0\r\n\r\n"
x = var1.encode('utf8')
y = var2.encode('utf8')
z = var3.encode('utf8')
yield x
yield y
yield z
url = 'https://example.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Connection': 'keep-alive', 'Content-Type': 'application/x-www-form-urlencoded',
'Transfer-Encoding': 'chunked'}
response = requests.post(url=url, data=generator(), headers=headers)
Disclaimer: I do not want my generator to automatically generate the chunked encoding of a message, I want to send a request with a specific chunked encoded message like shown above.

Related

Crawler got identified when headers provided and requst interval set

Steam provides an endpoint which shows market information of a specific item. Parameter includes country, currency, language, item_nameid, and two_factor.
For example,
https://steamcommunity.com/market/itemordershistogram?country=TW&language=english&currency=30&item_nameid=176185978&two_factor=0
means my country, currency, and language are TW, 30(TWD), and english. ID of the item I'm interesting in is 17618597.
Normally, you can continually access this endpoint without any problem after loggin. I tried accessing this endpoint thirty times within one minute, no problem occurred.
But when using requests library to access this endpoint, steam quickly identified crawling even when user-agent, cookies, other headers provided and requst interval set. I got ip banned after 8 to 12 requests.
My code:
url = 'https://steamcommunity.com/market/itemordershistogram'
payload = {'country': 'TW', 'language': 'english', 'currency': 30, 'item_nameid': 176185978, 'two_factor': 0}
# headers copied from network tab
temp = '''Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: keep-alive
Host: steamcommunity.com
Pragma: no-cache
sec-ch-ua: "Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'''
headers = {key: value for key, value in [line.split(': ') for line in temp.split('\n')]} // conver headers from string literals to dict
temp = 'cookies copied from network tab'
cookies = {key: value for key, value in [pair.split('=') for pair in temp.split('; ')]} // conver cookies from string literals to dict
session = Session()
for i in range(100):
print(f'{i}:')
res = session.get(url, headers=headers, params=payload, cookies=cookies)
print(res.status_code)
print(res.text)
sleep_time = 5 + random.random()
time.sleep(sleep_time)
I do some research on how to deal with anti-spider, most article saying setting user-agent, cookies or extending request interval would help. But in my case, steam still be able to identified crawling.
Some article says adding Referer header would help, but I see no Referer header in browser's network tab.
Therefore, I assume adding Referer header wouldn't help.
So I'm wondering, besides user-agent, cookies, other headers and the request interval, how can steam know I'm using a crawler?

how to post form request on python

I am trying to fill a form like that and submit it automaticly. To do that, I sniffed the packets while logging in.
POST /?pg=ogrgiris HTTP/1.1
Host: xxx.xxx.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
Origin: http://xxx.xxx.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15
Referer: http://xxx.xxx.com/?pg=ogrgiris
Upgrade-Insecure-Requests: 1
DNT: 1
Content-Length: 60
Connection: close
seviye=700&ilkodu=34&kurumkodu=317381&ogrencino=40&isim=ahm
I repeated that packet by burp suite and saw works porperly. the response was the html of the member page.
Now I tried to do that on python. The code is below:
import requests
url = 'http://xxx.xxx.com/?pg=ogrgiris'
headers = {'Host':'xxx.xxx.com',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Content-Type':'application/x-www-form-urlencoded',
'Referer':'http://xxx.xxx.com/?pg=ogrgiris',
'Content-Lenght':'60','Connection':'close'}
credentials = {'seviye': '700','ilkodu': '34','kurumkodu': '317381','ogrecino': '40','isim': 'ahm'}
r = requests.post(url,headers=headers, data=credentials)
print(r.content)
the problem is, that code prints the html of the login page even I send all of the credentials enough to log in. How can I get the member page? thanks.
If the POST request displays a page with the content you want, then the problem is only that you are sending data as JSON, not in "form" data format (application/x-www-form-urlencoded).
If a session is created at the request base and you have to make another request for the requested data, then you have to deal with cookies.
Problem with data format:
r = requests.post(url, headers=headers, data=credentials)
Kwarg json = creates a request body as follows:
{"ogrecino": "40", "ilkodu": "34", "isim": "ahm", "kurumkodu": "317381", "seviye": "700"}
While data= creates a request body like this:
seviye=700&ilkodu=34&kurumkodu=317381&ogrencino=40&isim=ahm
You can try https://httpbin.org:
from requests import post
msg = {"a": 1, "b": True}
print(post("https://httpbin.org/post", data=msg).json()) # Data as Form data, look at key `form`, it's object in JSON because it's Form data format
print(post("https://httpbin.org/post", json=msg).json()) # Data as json, look at key `data`, it's string
If your goal is to replicate the sample request, you are missing a lot of the headers; this in particular is very important Content-Type: application/x-www-form-urlencoded because it will tell your HTTP client how to format/encode the payload.
Check the documentation for requests so see how these form posts can work.

Python requests fail to scrape a web page by 403 Forbidden

I want to audit trains timetable. The trains have a GPS and their positions are published in https://trenesendirecto.sofse.gob.ar/mapas/sanmartin/index.php My plan is to scrape the train positions and check the time that they arrive to the stations and publish this info to all users.
In order to obtain train coordinates I write the following script in Python
import requests, random, string
#Function to generate random code for rnd
def RandomGenerator():
x = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(16))
return x
# URL requests
url = 'https://trenesendirecto.sofse.gob.ar/mapas/ajax_posiciones.php'
parametros = {
'ramal':'31',
'rnd':RandomGenerator(),
'key':'v%23v%23QTUNWp%23MpWR0wkj%23RhHTqVUM'}
encabezado = {
'Host': 'trenes.sofse.gob.ar',
'Referer': 'https://trenesendirecto.sofse.gob.ar/mapas/sanmartin/index.php',
'X-Requested-With': 'XMLHttpRequest',
'Accept':'application/json, text/javascript, */*',
'UserAgent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/65.0.3325.146 Safari/537.36'
}
res = requests.get(url, params = parametros, headers = encabezado, timeout=1)
# Output
print(res.url)
print(res.headers)
print(res.status_code)
print(res.content)
The output is:
https://trenesendirecto.sofse.gob.ar/mapas/ajax_posiciones.php?ramal=31&key=v%2523v%2523QTUNWp%2523MpWR0wkj%2523RhHTqVUM&rnd=ui8GObHTSpVpPqRo
{'Date': 'Tue, 13 Mar 2018 12:16:03 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Server': 'nginx'}
403
b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
Using the same url generated by the requests in the browser I obtain the following
output from browser, which is exactly what I want.
Why the script does not work?
Is there any other method to obtain the data?
Have you tried testing the API url on a REST Client such as Postman or Mozilla's RESTClient add-on? This is the first step in web development before you can consume web-services in an application.
Besides, error code 403 means you may not be authorized to access this data or do not have the right permissions set. The latter in most usually the case with 403 errors as it differs from a 401 error.
You must confirm whether the API uses Basic Auth or token-based authentication.
A general GET request on RESTClient for this url gives status: 200OK which means the endpoint responds to HTTP requests but needs authorization if you want to request certain information.

POST request failing after migrating from requests to urllib2

I have moved away from using the python requests library as it was a bit fiddly to get working on Google App Engine. Instead I'm using urllib2 which has better support. Unfortunately the POST request that previously worked with the requests library no longer works with urllib2.
With requests the code was as follows
values = { 'somekey' : 'somevalue'}
r = requests.post(some_url, data=values)
With urllib2, the code is as follows
values = { 'somekey' : 'somevalue'}
data = urllib.urlencode(values)
req = urllib2.Request(some_url, data)
response = urllib2.urlopen(req)
Unfortunately the latter raises the following error
HTTP Error 405: Method Not Allowed
The url I'm posting to has the following form:
some_url = 'http://ec2-11-111-111-1.compute-1.amazonaws.com'
I have read that there is an issue with trailing slashes with urllib2, however when I add the slash as follows
some_url = 'http://ec2-11-111-111-1.compute-1.amazonaws.com/'
I still get the same error.
There is no redirection at the destination - so the request shouldn't be transformed to a GET. Even if it were, the url actually accepts GET and POST requests. The EC2 linux instance is running django and I can see that it successfully receives the request and makes use of it - so doesn't return a 405. For some reason urllib2 is picking up a 405 and throwing the exception.
Any ideas as to what may be going wrong?
Edit 1:
As per #philip-tzou 's good call, the following information might help
print req.get_method()
yields POST
print req.header_items()
yields []
Edit 2
Adding a user agent header (as #padraic-cunningham suggests) didn't solve it unfortunately. I added the same header shown in the urllib2 example
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
data = urllib.urlencode(values)
req = urllib2.Request(some_url, data, headers)
Edit 3
As #furas suggested, I've sent the request I'm making to requestb.in to double check what's being sent. It is indeed a POST request being made, with the following headers:
Connection: close
Total-Route-Time: 0
X-Cloud-Trace-Context: ed073df03ccd05657<removed>2a203/<removed>129639701404;o=5
Connect-Time: 1
Cf-Ipcountry: US
Cf-Ray: <removed>1d155ac-ORD
Cf-Visitor: {"scheme":"http"}
Content-Length: 852
Content-Type: application/x-www-form-urlencoded
Via: 1.1 vegur
X-Request-Id: 20092050-5df4-42f8-8fe0-<removed>
Accept-Encoding: gzip
Cf-Connecting-Ip: <removed>
Host: requestb.in
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppEngine-Google; (+http://code.google.com/appengine; appid: s~<removed>)
and now req.header_items() yields [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)')]
by way of comparison, the headers from the requests POST are
Cf-Connecting-Ip: <removed>
Accept-Encoding: gzip
Host: requestb.in
Cf-Ipcountry: US
Total-Route-Time: 0
Cf-Ray: 2f42a45185<removed>-ORD
Connect-Time: 1
Connection: close
Cf-Visitor: {"scheme":"http"}
Content-Type: application/x-www-form-urlencoded
Content-Length: 852
X-Cloud-Trace-Context: 8932f0f9b165d9a0f698<removed>/1740038835186<removed>;o=5
Accept: */*
X-Request-Id: c88a29e1-660e-4961-8112-<removed>
Via: 1.1 vegur
User-Agent: python-requests/2.11.1 AppEngine-Google; (+http://code.google.com/appengine; appid: s~<removed>)

Sending HTTP Post Requests work in Fiddler but not in Python

I am sending several post requests in Fiddler2 to check my site to make sure it is working properly. However, when I automate in Python to simulate this over several hours (I really don't want to spend 7 hours hitting space!).
This works in fiddler. I can create the account and perform related API commands. However in Python, nothing happens with this code:
def main():
import socket
from time import sleep
x = raw_input("Points: ")
x = int(x)
x = int(x/150)
for y in range(x):
new = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
new.connect(('example.com', 80))
mydata ="""POST http://www.example.com/api/site/register/ HTTP/1.1
Host: www.example.com
Connection: keep-alive
Content-Length: 191
X-NewRelic-ID: UAIFVlNXGwEFV1hXAwY=
Origin: http://www.example.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
X-CSRFToken: CEC9EzYaQOGBdO9HGPVVt3Fg66SVWVXg
DNT: 1
Referer: http://www.example.com/signup
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en;q=0.8
Cookie: sessionid=sessionid; sb-closed=true; arp_scroll_position=600; csrftoken=2u92jo23g929gj2; __utma=912.1.1.2.5.; __utmb=9139i91; __utmc=2019199; __utmz=260270731.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)
username=user&password=password&moredata=here """
new.send(mydata.encode('hex'))
print "Sent", y, "of", x
sleep(1)
print "Sent all!"
print "restarting"
main()
main()
I know I could use While True, but I intend to add more functions later to test more sites.
Why does this program not do anything to the API, when Fiddler2 can? I know it is my program, as I can send the exact same packet in fiddler (obviously pointing to the right place) and it works.
PS - If anyone does fix this, as its probably something really obvious, please can you only use modules that are bundled with Python. I cannot install modules from other places. Thanks!
HTTP requests are not as easy as you think they are. First of all this is wrong:
"""POST http://www.example.com/api/site/register/ HTTP/1.1
Host: www.example.com
Connection: keep-alive
...
"""
Each line in HTTP request has to end with CRLF (in Python with \r\n), i.e. it should be:
"""POST http://www.example.com/api/site/register/ HTTP/1.1\r
Host: www.example.com\r
Connection: keep-alive\r
...
"""
Note: LF = line feed = \n is there implicitly. Also you didn't see CR in your fiddler, because it's a white-space. But it has to be there (simple copy-paste won't work).
Also HTTP specifies that after headers there has to be CRLF as well. I.e. your entire request should be:
mydata = """POST http://www.example.com/api/site/register/ HTTP/1.1\r
Host: www.example.com\r
Connection: keep-alive\r
Content-Length: 191\r
X-NewRelic-ID: UAIFVlNXGwEFV1hXAwY=\r
Origin: http://www.example.com\r
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36\r
Content-Type: application/x-www-form-urlencoded; charset=UTF-8\r
Accept: application/json, text/javascript, */*; q=0.01\r
X-Requested-With: XMLHttpRequest\r
X-CSRFToken: CEC9EzYaQOGBdO9HGPVVt3Fg66SVWVXg\r
DNT: 1\r
Referer: http://www.example.com/signup\r
Accept-Encoding: gzip,deflate,sdch\r
Accept-Language: en-GB,en;q=0.8\r
Cookie: sessionid=sessionid; sb-closed=true; arp_scroll_position=600; csrftoken=2u92jo23g929gj2; __utma=912.1.1.2.5.; __utmb=9139i91; __utmc=2019199; __utmz=260270731.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)\r
\r
username=user&password=password&moredata=here"""
Warning: it should be exactly as I've written. There can't be any spaces in front of each line, i.e. this:
mydata = """POST http://www.example.com/api/site/register/ HTTP/1.1\r
Host: www.example.com\r
Connection: keep-alive\r
...
"""
is wrong.
Side note: you can move mydata to the top, outside of the loop. Unimportant optimization but makes your code cleaner.
Now you've said that the site you are using wants you to hex-encode HTTP request? It's hard for me to believe that (HTTP is a raw string by definition). Don't do that (and ask them to specify what exactly this hex-encoding means). Possibly they've meant that the URL should be hex-encoded (since it is the only hex-encoding that is actually used in HTTP)? In your case there is nothing to encode so don't worry about it. Just remove the .encode('hex') line.
Also Content-Length header is messed up. It should be the actual length of the content. So if for example body is username=user&password=password&moredata=here then it should be Content-Length: 45.
Next thing is that the server might not allow you making multiple requests without getting a response. You should use new.recv(b) where b is a number of bytes you want to read. But how many you should read? Well this might be problematic and that's where Content-Length header comes in. First you have to read headers (i.e. read until you read \r\n\r\n which means the end of headers) and then you have to read the body (based on Content-Length header). As you can see things are becoming messy (see: final section of my answer).
There are probably more issues with your code. For example X-CSRFToken suggests that the site uses CSRF prevention mechanism. In that case your request might not work at all (you have to get the value of X-CSRFToken header from the server).
Finally: don't use sockets directly. Httplib (http://docs.python.org/2/library/httplib.html) is a great (and built-in) library for making HTTP requests which will deal with all the funky and tricky HTTP stuff for you. Your code for example may look like this:
import httplib
headers = {
"Host": "www.example.com",
"X-NewRelic-ID": "UAIFVlNXGwEFV1hXAwY=",
"Origin": "http://www.example.com",
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Accept": "application/json, text/javascript, */*; q=0.01",
"X-Requested-With": "XMLHttpRequest",
"X-CSRFToken": "CEC9EzYaQOGBdO9HGPVVt3Fg66SVWVXg",
"DNT": "1",
"Referer": "http://www.example.com/signup",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "en-GB,en;q=0.8",
"Cookie": "sessionid=sessionid; sb-closed=true; arp_scroll_position=600; csrftoken=2u92jo23g929gj2; __utma=912.1.1.2.5.; __utmb=9139i91; __utmc=2019199; __utmz=260270731.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)"
}
body = "username=user&password=password&moredata=here"
conn = httplib.HTTPConnection("example.com")
conn.request("POST", "http://www.example.com/api/site/register/", body, headers)
res = conn.getresponse()
Note that you don't need to specify Content-Length header.

Categories