Python: How to get URL when running a HTTP proxy server? - python

I am creating an HTTP Proxy Server that is able to retrieve the URL of the website requested by a user. I am only allowed to use a single file for my HTTP Proxy Server (I can't have multiple files).
I am able within a infinite running while loop to detect a connection and the address and receive a message from the client:
while True:
conn, addr = created_socket.accept()
data_received = conn.recv(1024)
print(data_received)
When I run my server on a specified port and type the [IP Address]:[Port Number] into Chrome, I get the following result after printing data_received:
b'GET /www.google.com HTTP/1.1\r\nHost: 192.168.1.2:5050\r\nConnection: keep-alive\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n
Is there a systematic way in which I can retrieve the URL (in this case, www.google.com)? Right now, I am coding in a constant buffer size for conn.recv (1024). However, I was wondering if there was first a way to first retrieve the message size of the client, store it in a variable, and then pass that variable to the buffer size parameter for recv?

Related

Python: imitate WebSocket connection as if comming from Browser

I am trying to automate an interaction with a game website by communicating with a WebSocket via Python.
In particular, I am trying to communicate with the WebSocket at: "wss://socket.colonist.io".
If I simply execute the following JS-code from the browser, I receive the incoming messages as expected:
ws = new WebSocket('wss://socket.colonist.io');
ws.onmessage = e => {
console.log(e);
}
However, as soon as I am trying to connect to this WebSocket from outside the browser (with Node.JS or with Python), the connection gets immediately closed by the remote. An example using websocket-client in Python can be found below:
import websocket
def on_message(ws, data):
print(f'received {data}')
websocket.enableTrace(True)
socket = websocket.WebSocketApp('wss://socket.colonist.io',
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'},
on_message=on_message)
socket.run_forever(origin='https://colonist.io')
socket.close()
The trace output is the following:
--- request header ---
GET / HTTP/1.1
Upgrade: websocket
Host: socket.colonist.io
Origin: https://colonist.io
Sec-WebSocket-Key: EE3U0EDp36JGZBHWUN5q4Q==
Sec-WebSocket-Version: 13
Connection: Upgrade
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
-----------------------
--- response header ---
HTTP/1.1 101 Switching Protocols
Server: nginx/1.18.0 (Ubuntu)
Date: Sat, 24 Sep 2022 17:33:32 GMT
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Accept: EwMJ+z82BuOBOSWONpuOhjNdVCQ=
-----------------------
websocket connected
Connection to remote host was lost. - goodbye
I also tried it using Python-Autobahn and Python-websockets, both with the same negative result.
I suspect the host somehow detects that the connection is not coming from a browser (although, I set a 'User-Agent' and the 'Origin') and therefore closes the connection immediately. Is there any possibility I can connect to this WebSocket from a Script NOT running in a browser?
I am aware of the possibility of using Selenium to run and control a browser instance with Python, but I want to avoid this at all cost, due to performance reasons. (I want to control as many WebSocket connections concurrently as possible for my project).
I found the problem. Because the connection worked from a new Incognito-Window from the Chrome-Console without ever visiting the host colonist.io and the "Application" tab from the Chrome developer panel did not show any stored cookies, I assumed no cookies were involved. After decrypting and analyzing the TLS communication with Wireshark I found out that a JWT gets sent as a cookie on the initial GET request. After adding this cookie to my Python implementation, it worked without any problems.
The remaining question now is: Where does this JWT come from if I don't see it in the "Application" tab and the only request being made is the WebSocket connection?

Google scholar scraping with ip-rotator through AWS ApiGateway

I am receiving the error below. The code (George method, https://stackoverflow.com/users/7173479/george) worked in the beginning a couple of times and a bit later it crashed. It should be something with the configuration of HTTP but I am lost in the AWS documentation. I am working on the jupyter notebook. Anybody could help?
Create gateway object and initialise in AWS
engine = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q={}&btnG='
gateway = ApiGateway(engine,\
access_key_id="KEY", access_key_secret="SECRET_KEY")
gateway.start()
Assign gateway to session
session = requests.Session()
session.mount(engine, gateway)
Send request (IP will be randomised)
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
search_string = '{}+and+{}+and+{}+and+{}'.format('term1','term2','term3','term4')
url = engine.format(search_string)
print(url)
response = session.get(url,headers=header)
tree = BeautifulSoup(response.content,'lxml')
result = tree.find('div',id='gs_ab_md')
print(response.status_code)
print(result.text)
print(len(result.text))
number=[int(s.replace('.','').replace(',','')) for s in result.text.split() \
if s.replace('.','').replace(',','').isdigit()]
Delete gateways
gateway.shutdown()
=====================================
BadRequestException: An error occurred (BadRequestException) when calling the PutIntegration operation: Invalid HTTP endpoint specified for URI
The site parameter for the ApiGateway constructor in the requests-ip-rotator package expects to be just the site. It can't have any part of the URI other than the protocol, domain name or IP address, and port.
If you change your constructor to something like this:
gateway = ApiGateway("https://scholar.google.com")
gateway.start()
It will construct the gateway endpoint correctly.

Headers while running EC2 Ubuntu instance

I am attempting to run my code on a aws ec2(ubuntu) instance. The codes work perfectly fine on my local but doesnt seem to be able to connect to website inside server.
Im assuming it has to do something with the headers. I have installed firefox and chrome on the server but doesnt seem to do anything.
Any ideas on how to fix this problem would be appreciated.
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}
# Making a get request
response = requests.get("https://us.louisvuitton.com/eng-us/products/pocket-organizer-monogram-other-nvprod2380073v", headers=HEADERS) #hangs here, cant make request in server
# print response
print(response.status_code)
Output:
Doesn't give me one, just stays blank until I KeyboardInterrupt.

Connection error in python-requests

I'm trying to search using beautifulsoup with anaconda for python 3.6.
I am trying to scrape accuweather.com to find the weather in Tel Aviv.
This is my code:
from bs4 import BeautifulSoup
import requests
data=requests.get("https://www.accuweather.com/he/il/tel-
aviv/215854/weather-forecast/215854")
soup=BeautifulSoup(data.text,"html parser")
soup.find('div',('class','info'))
I get this error:
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
What can I do and what does this error mean?
What does this error mean
Googling for "errno 10600" yields quite a few results. Basically, it's a low-level network error (it's not http specific, you can have the same issue for any kind of network connection), whose canonical description is
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
IOW, your system failed to connect to the host. This might come from a lot of reasons, either temporary (like your internet connection is down) or not (like a proxy - if you are behind a proxy - blocking access to this host, etc), or quite simply (as is the case here) the host blocking your requests.
The first thing to do when you have such an error is to check your internet connection, then try to get the url in your browser. If you can get it in your browser then it's most often the host blocking you, most often based on your client's "user-agent" header (the client here is requests), and specifying a "standard" user-agent header as explained in newbie's answer should solve the problem (and it does in this case, or at least it did for me).
NB : to set the user agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data = requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
The problem does not come from the code, but from the website.
If you add User-Agent field in the header of the request it will look like it comes from a browser.
Example:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data=requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)

Python Requests splits TCP packet

I am trying to script a HTTP POST request with python.
When trying it with curl from bash, everything is working. With python, using either the requests or the urllib3-library, I am getting an error response from the API. The POST request contains information in headers and as json in the request body.
What I noticed, when I intercept the packets with Wireshark, the curl-request (which is working) is one single packet of length 374 bytes. The python-request (no difference between requests and urllib3 here) is splitted into 2 separate packets of 253 and 144 bytes length.
Wireshark reassembles these without problems and they both seem to contain the complete information in header and POST body. But the API I am trying to connect to answeres with a not very helpful "Error when processing request".
As the 253 bytes can't be the limit of a TCP-packet, what is the reason for that behavior? Is there a way to fix that?
EDIT:
bash:
curl 'http://localhost/test.php' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Content-Type: application/json' -d '{"key1":"value1","key2":"value2","key3":"value3"}'
python:
import requests, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36",
"Content-Type": "application/json"}
data = {"key1":"value1", "key2":"value2", "key3":"value3"}
r=requests.post("http://localhost/test.php", headers=headers, data=json.dumps(data))
TCP is a data stream and not a series of messages. The segmentation of the data stream into packets should be of no relevance to the interpretation of the data stream, neither in sender nor recipient. If the recipients actually behaves differently based on the segmentation of the packets the the recipient is broken.
While I've seen such broken systems I've seen more systems which do not like the request for different reasons, like wrong user agent, missing accept header or similar. I would suggest you check this first before concluding that it must be the segmentation of the data stream.
As for why curl and requests behave differently: probably curl first constructs the full request (header and body) and sends it while requests constructs first the header and sends it and then sends the body, i.e. does two write operations which might result in two packets.
Although it should not matter for the issue you are having, there is a way to force the data being sent into one packet for multiple sends, namely using the TCP_CORK option on the socket (platform dependent though).
Create an adapter first:
from requests.packages.urllib3.connection import HTTPConnection
class HTTPAdapterWithSocketOptions(requests.adapters.HTTPAdapter):
def __init__(self, *args, **kwargs):
self.socket_options = kwargs.pop("socket_options", None)
super(HTTPAdapterWithSocketOptions, self).__init__(*args, **kwargs)
def init_poolmanager(self, *args, **kwargs):
if self.socket_options is not None:
kwargs["socket_options"] = self.socket_options
super(HTTPAdapterWithSocketOptions, self).init_poolmanager(*args, **kwargs)
Then use it for the requests you want to send out:
s = requests.Session()
options = HTTPConnection.default_socket_options + [ (socket.IPPROTO_TCP, socket.TCP_CORK, 1)]
adapter = HTTPAdapterWithSocketOptions(socket_options=options)
s.mount("http://", adapter)
Sadly there are indeed very broken systems as #Steffen Ullrich explains (even though they claim to be industry standards) which aren't capable of handling fragmented TCP frames. Since my application/script is rather isolated and self-contained, I used the simpler workaround based on #Roeften 's answer which applies TCP_CORK to all connections.
Warning: this workaround makes sense only in situations when you don't risk breaking any other functionality relying on requests.
requests.packages.urllib3.connection.HTTPConnection.default_socket_options = [(6,3,1)]

Categories