I am trying to automate an interaction with a game website by communicating with a WebSocket via Python.
In particular, I am trying to communicate with the WebSocket at: "wss://socket.colonist.io".
If I simply execute the following JS-code from the browser, I receive the incoming messages as expected:
ws = new WebSocket('wss://socket.colonist.io');
ws.onmessage = e => {
console.log(e);
}
However, as soon as I am trying to connect to this WebSocket from outside the browser (with Node.JS or with Python), the connection gets immediately closed by the remote. An example using websocket-client in Python can be found below:
import websocket
def on_message(ws, data):
print(f'received {data}')
websocket.enableTrace(True)
socket = websocket.WebSocketApp('wss://socket.colonist.io',
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'},
on_message=on_message)
socket.run_forever(origin='https://colonist.io')
socket.close()
The trace output is the following:
--- request header ---
GET / HTTP/1.1
Upgrade: websocket
Host: socket.colonist.io
Origin: https://colonist.io
Sec-WebSocket-Key: EE3U0EDp36JGZBHWUN5q4Q==
Sec-WebSocket-Version: 13
Connection: Upgrade
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
-----------------------
--- response header ---
HTTP/1.1 101 Switching Protocols
Server: nginx/1.18.0 (Ubuntu)
Date: Sat, 24 Sep 2022 17:33:32 GMT
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Accept: EwMJ+z82BuOBOSWONpuOhjNdVCQ=
-----------------------
websocket connected
Connection to remote host was lost. - goodbye
I also tried it using Python-Autobahn and Python-websockets, both with the same negative result.
I suspect the host somehow detects that the connection is not coming from a browser (although, I set a 'User-Agent' and the 'Origin') and therefore closes the connection immediately. Is there any possibility I can connect to this WebSocket from a Script NOT running in a browser?
I am aware of the possibility of using Selenium to run and control a browser instance with Python, but I want to avoid this at all cost, due to performance reasons. (I want to control as many WebSocket connections concurrently as possible for my project).
I found the problem. Because the connection worked from a new Incognito-Window from the Chrome-Console without ever visiting the host colonist.io and the "Application" tab from the Chrome developer panel did not show any stored cookies, I assumed no cookies were involved. After decrypting and analyzing the TLS communication with Wireshark I found out that a JWT gets sent as a cookie on the initial GET request. After adding this cookie to my Python implementation, it worked without any problems.
The remaining question now is: Where does this JWT come from if I don't see it in the "Application" tab and the only request being made is the WebSocket connection?
Related
I am creating an HTTP Proxy Server that is able to retrieve the URL of the website requested by a user. I am only allowed to use a single file for my HTTP Proxy Server (I can't have multiple files).
I am able within a infinite running while loop to detect a connection and the address and receive a message from the client:
while True:
conn, addr = created_socket.accept()
data_received = conn.recv(1024)
print(data_received)
When I run my server on a specified port and type the [IP Address]:[Port Number] into Chrome, I get the following result after printing data_received:
b'GET /www.google.com HTTP/1.1\r\nHost: 192.168.1.2:5050\r\nConnection: keep-alive\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n
Is there a systematic way in which I can retrieve the URL (in this case, www.google.com)? Right now, I am coding in a constant buffer size for conn.recv (1024). However, I was wondering if there was first a way to first retrieve the message size of the client, store it in a variable, and then pass that variable to the buffer size parameter for recv?
I'm not sure how much of the code I can show, but the concept is simple. I am writing a python script that works with the TD Ameritrade API. I am getting a url for the portal from the API, and opening it in the browser. Next, I'm setting up a socket server to handle the redirect of the portal. Below is the code for the server:
serversocket = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
# get local machine name
host = "localhost"
port = 10120
serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
# bind to the port
serversocket.bind((host, port))
print("listening")
# queue up to 5 requests
serversocket.listen(5)
allData = ""
while True:
# establish a connection
try:
conn,addr = serversocket.accept()
except KeyboardInterrupt:
break
print("Got a connection from %s" % str(addr))
while True:
data = conn.recv(4096)
if not data:
print("done")
break
print(data.decode('utf-8', "ignore"))
conn.close()
When I go through the portal and get redirected, in the console I see the following:
Got a connection from ('127.0.0.1', 65505)
|,?2!c[N': [?`XAn] "::+/,0̨̩ / 5
jj localhost
3 + )http/1.1
ej\E<zpִ_<%q\r)+ - +
jj zz
However, if I were to copy the URL, open a new tab, paste it and go, I get the following (correct) response:
Got a connection from ('127.0.0.1', 49174)
GET /?code=<RESPONSE_TOKEN> HTTP/1.1
Host: localhost:10120
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.125
Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-
exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
When I go to the network overview of my browser, I see the following warning when trying to view the request headers: Provisional headers are shown
And the only difference between the http request from the redirect, and the request when I manually paste the url in is the initiator column in the network viewer is "oath" for the redirect, and "other" when manually pasted in.
I hope I've provided enough information and code. I can try to make a copy for reproducing if needed, but a TD Ameritrade Developer account would be needed to connect with the API.
Thanks in advance for any help. I've been researching for over 6 hours and wasn't able to find anything. Hopefully I didn't miss something obvious.
I think, socket is not required to handle an oauth redirect. Socket is for another kind of requeriments.
Also when you manually hit the redirect, a socket is not invoked. Just a simple http endpoint.
Try with this snippet which has the oauth code extract:
from urlparse import urlparse,parse_qsl
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
url = urlparse(self.path)
code = parse_qsl(url.query)['code']
Or this:
https://gist.github.com/willnix/daed2b57ab8d613f6bfa53c6d0b46fd3
You can get more snippets of simple http get endpoints here:
https://gist.github.com/search?q=def+do_GET+python&ref=searchresults
I'm trying to download several large files from a server through a url.
When downloading the file through Chrome or Internet Explorer it takes around 4-5 minutes to download the file, which has a size of around 100 MB.
But when I try to do the same download using either PyCurl
buffer = BytesIO()
ch = curl.Curl()
ch.setopt(ch.URL, url)
ch.setopt(curl.TRANSFERTEXT, True)
ch.setopt(curl.AUTOREFERER, True)
ch.setopt(curl.FOLLOWLOCATION, True)
ch.setopt(curl.POST, False)
ch.setopt(curl.SSL_VERIFYPEER, 0)
ch.setopt(curl.WRITEFUNCTION, buffer.write)
ch.perform()
Or using requests
r = requests.get(url).text
I get the
'pycurl.error: (56, 'OpenSSL SSL_read: Connection was reset, errno 10054')'
or
'[Errno 10054] An existing connection was forcibly closed by the remote host.'
When I look in Chrome during the download of the large file this is what I see
General:
Referrer Policy: no-referrer-when-downgrade
Requests Header:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cache-Control: no-cache
Connection: keep-alive
Cookie : JSESSIONID = ****
Pragma: no-cache
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/... (KHTML, like Gecko) Chrome/... Safari/...
Is there anything I can do in my configuration to not have the connection close, similar to when I access it through my browser? Or is it on the server side the problem is?
EDIT
To add more information. Most of the time after the request is made is spent waiting for the server to put together the data before the actual download starts (it is generating an XML file by aggregating data from different data sources).
Try adding headers and cookies to your request.
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36" }
cookies = { "Cookie1": "Value1"}
r = requests.get(url, headers=headers, cookies=cookies)
I'm trying to search using beautifulsoup with anaconda for python 3.6.
I am trying to scrape accuweather.com to find the weather in Tel Aviv.
This is my code:
from bs4 import BeautifulSoup
import requests
data=requests.get("https://www.accuweather.com/he/il/tel-
aviv/215854/weather-forecast/215854")
soup=BeautifulSoup(data.text,"html parser")
soup.find('div',('class','info'))
I get this error:
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
What can I do and what does this error mean?
What does this error mean
Googling for "errno 10600" yields quite a few results. Basically, it's a low-level network error (it's not http specific, you can have the same issue for any kind of network connection), whose canonical description is
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
IOW, your system failed to connect to the host. This might come from a lot of reasons, either temporary (like your internet connection is down) or not (like a proxy - if you are behind a proxy - blocking access to this host, etc), or quite simply (as is the case here) the host blocking your requests.
The first thing to do when you have such an error is to check your internet connection, then try to get the url in your browser. If you can get it in your browser then it's most often the host blocking you, most often based on your client's "user-agent" header (the client here is requests), and specifying a "standard" user-agent header as explained in newbie's answer should solve the problem (and it does in this case, or at least it did for me).
NB : to set the user agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data = requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
The problem does not come from the code, but from the website.
If you add User-Agent field in the header of the request it will look like it comes from a browser.
Example:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data=requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
I've written a simple http web server using python but I've noticed that when I connect to it, the html page appears in the browser window but the indicator in the chrome tab continues to spin and the server receives empty strings. This continues until I click the 'X' to stop loading the page. Could someone please explain why this is happening and how to fix this. Also, if http headers are wrong or I'm missing important ones please tell me. I found it very difficult to find information on http headers and commands.
You find the code here.
Link to image of network tab
Console output:
Socket created
Bound socket
Socket now listening
Connected with 127.0.0.1:55146
Connected with 127.0.0.1:55147Received data: GET / HTTP/1.1
Host: localhost
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0
.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/33.0.1750.154 Safari/537.36
DNT: 1
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Parsing GET command
Client requested directory /index.html with HTTP version 1.1
html
/index.html
Reply headers:
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Server: Max's Python Web Server/1.0
Cache-Control: max-age=600, public
Connected with 127.0.0.1:55148
Received data:
Received data:
Received data:
Received data:
Your fault is how you think about sockets:
socket.recv will wait forever for data from clients
You don't need a loop here.
However, your requests will be limited by recv param.
But if you want to allow any size request,
you should detect the end of data by HTTP specification.
For example, if you wait headers only, double linefeed will mean they ends.
And size of body(for POST method for example) should be passed with Content-length header as I know.
You issue is same as in this question: link
And google for HTTP Specifications, if you want to make correct HTTP server.