I am trying to script a HTTP POST request with python.
When trying it with curl from bash, everything is working. With python, using either the requests or the urllib3-library, I am getting an error response from the API. The POST request contains information in headers and as json in the request body.
What I noticed, when I intercept the packets with Wireshark, the curl-request (which is working) is one single packet of length 374 bytes. The python-request (no difference between requests and urllib3 here) is splitted into 2 separate packets of 253 and 144 bytes length.
Wireshark reassembles these without problems and they both seem to contain the complete information in header and POST body. But the API I am trying to connect to answeres with a not very helpful "Error when processing request".
As the 253 bytes can't be the limit of a TCP-packet, what is the reason for that behavior? Is there a way to fix that?
EDIT:
bash:
curl 'http://localhost/test.php' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Content-Type: application/json' -d '{"key1":"value1","key2":"value2","key3":"value3"}'
python:
import requests, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36",
"Content-Type": "application/json"}
data = {"key1":"value1", "key2":"value2", "key3":"value3"}
r=requests.post("http://localhost/test.php", headers=headers, data=json.dumps(data))
TCP is a data stream and not a series of messages. The segmentation of the data stream into packets should be of no relevance to the interpretation of the data stream, neither in sender nor recipient. If the recipients actually behaves differently based on the segmentation of the packets the the recipient is broken.
While I've seen such broken systems I've seen more systems which do not like the request for different reasons, like wrong user agent, missing accept header or similar. I would suggest you check this first before concluding that it must be the segmentation of the data stream.
As for why curl and requests behave differently: probably curl first constructs the full request (header and body) and sends it while requests constructs first the header and sends it and then sends the body, i.e. does two write operations which might result in two packets.
Although it should not matter for the issue you are having, there is a way to force the data being sent into one packet for multiple sends, namely using the TCP_CORK option on the socket (platform dependent though).
Create an adapter first:
from requests.packages.urllib3.connection import HTTPConnection
class HTTPAdapterWithSocketOptions(requests.adapters.HTTPAdapter):
def __init__(self, *args, **kwargs):
self.socket_options = kwargs.pop("socket_options", None)
super(HTTPAdapterWithSocketOptions, self).__init__(*args, **kwargs)
def init_poolmanager(self, *args, **kwargs):
if self.socket_options is not None:
kwargs["socket_options"] = self.socket_options
super(HTTPAdapterWithSocketOptions, self).init_poolmanager(*args, **kwargs)
Then use it for the requests you want to send out:
s = requests.Session()
options = HTTPConnection.default_socket_options + [ (socket.IPPROTO_TCP, socket.TCP_CORK, 1)]
adapter = HTTPAdapterWithSocketOptions(socket_options=options)
s.mount("http://", adapter)
Sadly there are indeed very broken systems as #Steffen Ullrich explains (even though they claim to be industry standards) which aren't capable of handling fragmented TCP frames. Since my application/script is rather isolated and self-contained, I used the simpler workaround based on #Roeften 's answer which applies TCP_CORK to all connections.
Warning: this workaround makes sense only in situations when you don't risk breaking any other functionality relying on requests.
requests.packages.urllib3.connection.HTTPConnection.default_socket_options = [(6,3,1)]
Related
I am creating an HTTP Proxy Server that is able to retrieve the URL of the website requested by a user. I am only allowed to use a single file for my HTTP Proxy Server (I can't have multiple files).
I am able within a infinite running while loop to detect a connection and the address and receive a message from the client:
while True:
conn, addr = created_socket.accept()
data_received = conn.recv(1024)
print(data_received)
When I run my server on a specified port and type the [IP Address]:[Port Number] into Chrome, I get the following result after printing data_received:
b'GET /www.google.com HTTP/1.1\r\nHost: 192.168.1.2:5050\r\nConnection: keep-alive\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n
Is there a systematic way in which I can retrieve the URL (in this case, www.google.com)? Right now, I am coding in a constant buffer size for conn.recv (1024). However, I was wondering if there was first a way to first retrieve the message size of the client, store it in a variable, and then pass that variable to the buffer size parameter for recv?
I am trying to send HTTP GET request to certain website, for example, https://www.united.com, but it get stuck with no response.
Here is the code:
from urllib.request import urlopen
url = 'https://www.united.com'
resp = urlopen(url,timeout=10 )
Every time, it goes timeout. But the same code works for other URLs, for example, https://www.aa.com.
So I wonder what is behind https://www.united.com that keeps me from getting the HTTP request through. Thank you!
Update:
Adding a request header still doesn't work for this site:
from urllib.request import urlopen
url = 'https://www.united.com'
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
}
)
resp = urlopen(req,timeout=3 )
The server of united.com might be only responding to certain user-agent strings or request headers and blocking for others. You have to send certain headers or user-agent string which are allowed by their server. This depends upon website to website who want to add some more security to their applications so they are very specific about user-agents like which resource is trying to access them.
I am trying to send a GET request to an API using Node.JS. I don't have control over the server side. The API requires two things to be authenticated. I am getting those two values by logging in manually and then copying them over from chrome to my script.
A cookie
The user-agent that was used to perform the login
While this whole thing used to work a couple weeks or months ago, I now kept getting a status 401 (unauthorized). I asked a friend for help, who isn't a pro in node, but pretty good with python. He tried to build the same request with python and to our both surprise, it works perfectly fine.
So here I am, having two scripts that are supposed to do an absolutely identical action, but both have a different outcome. The request headers are both identical - since the python request works fine, it's also confirmed, that these are valid and enough to authenticate the request. They are both running on the same machine under Windows 10.
Script in Node.JS (returns a 401 - unauthorized):
const request = require("request");
const url = "https://api.rollbit.com/steam/market?query&order=1&showTradelocked=false&showCustomPriced=true&min=0&max=4294967295"
const headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Cookie': '__Secure-RollbitSession=JWDEFp________HaLLfT'
}
request.get(url, {headers: headers, json: true}, function(err, resp, body){
console.log(" > Response status in JS: " + resp.statusCode)
})
Same script in Python (returns a 200 - success):
import requests
url = "https://api.rollbit.com/steam/market?query&order=1&showTradelocked=false&showCustomPriced=true&min=0&max=4294967295"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Cookie': '__Secure-RollbitSession=JWDEFp________HaLLfT'
}
r = requests.request("GET", url, headers=headers)
print(" > Response status: in PY:", r.status_code)
Things I've tried:
I intercepted both requests in the scripts above with http toolkit to see if python is adding something to the headers.
Node.JS request - returned 401
Python request - returned 200
As seen in the intercepted results, python is adding some accept-encoding and accept headers. I tried to copy the FULL exact same headers python is sending into my node.js script, but I still get the same result (401) even though the (once again) intercepted requests now look identical.
I'm on the newest python and tried node 10.x, 12.18.0 and also the latest release.
At this point I don't know what to try any more. I don't really need it, but its completely bugging me that it isn't working for mysterious reasons and I would really like to find out what is happening.
Thank you!
I'm trying to search using beautifulsoup with anaconda for python 3.6.
I am trying to scrape accuweather.com to find the weather in Tel Aviv.
This is my code:
from bs4 import BeautifulSoup
import requests
data=requests.get("https://www.accuweather.com/he/il/tel-
aviv/215854/weather-forecast/215854")
soup=BeautifulSoup(data.text,"html parser")
soup.find('div',('class','info'))
I get this error:
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
What can I do and what does this error mean?
What does this error mean
Googling for "errno 10600" yields quite a few results. Basically, it's a low-level network error (it's not http specific, you can have the same issue for any kind of network connection), whose canonical description is
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
IOW, your system failed to connect to the host. This might come from a lot of reasons, either temporary (like your internet connection is down) or not (like a proxy - if you are behind a proxy - blocking access to this host, etc), or quite simply (as is the case here) the host blocking your requests.
The first thing to do when you have such an error is to check your internet connection, then try to get the url in your browser. If you can get it in your browser then it's most often the host blocking you, most often based on your client's "user-agent" header (the client here is requests), and specifying a "standard" user-agent header as explained in newbie's answer should solve the problem (and it does in this case, or at least it did for me).
NB : to set the user agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data = requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
The problem does not come from the code, but from the website.
If you add User-Agent field in the header of the request it will look like it comes from a browser.
Example:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data=requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
Is there a way to find the user-agent and global ip in particular json format? Help me out on this.
Here is what I am trying have partial success in getting global IP but no information about user-agent.
import requests, json
r = requests.get('http://httpbin.org/ip').json()
print r['origin']
Above code returning me the Global IP but I want some information regarding on which platform I am connected to the particular URL. E.g. 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'.
Considering you have a JSON object as string (possibly read from a file?) you would first want to convert that into a Python dictionary object
import json
request_details = json.loads('{"user-agent": "Chrome", "remote_address": "64.10.1.1"}')
print request_details["user-agent"]
print request_details["remote_address"]
OR
If you are talking about a request that comes to the server, the user-agent is part of the request headers and remote_address is added later in the network layer. Different web frameworks have different ways of letting you access these values. For example Django lets you access from HttpRequest.META dictionary. Flask gives you request.headers.get("user-agent") and request.remote_addr .
You can
import requests, json
r = requests.get('https://httpbin.org/user-agent').json()
print r['user-agent']
but I would do that only when I want to verify the user-agent I'm setting in my request header.