I've written a simple http web server using python but I've noticed that when I connect to it, the html page appears in the browser window but the indicator in the chrome tab continues to spin and the server receives empty strings. This continues until I click the 'X' to stop loading the page. Could someone please explain why this is happening and how to fix this. Also, if http headers are wrong or I'm missing important ones please tell me. I found it very difficult to find information on http headers and commands.
You find the code here.
Link to image of network tab
Console output:
Socket created
Bound socket
Socket now listening
Connected with 127.0.0.1:55146
Connected with 127.0.0.1:55147Received data: GET / HTTP/1.1
Host: localhost
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0
.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/33.0.1750.154 Safari/537.36
DNT: 1
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Parsing GET command
Client requested directory /index.html with HTTP version 1.1
html
/index.html
Reply headers:
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Server: Max's Python Web Server/1.0
Cache-Control: max-age=600, public
Connected with 127.0.0.1:55148
Received data:
Received data:
Received data:
Received data:
Your fault is how you think about sockets:
socket.recv will wait forever for data from clients
You don't need a loop here.
However, your requests will be limited by recv param.
But if you want to allow any size request,
you should detect the end of data by HTTP specification.
For example, if you wait headers only, double linefeed will mean they ends.
And size of body(for POST method for example) should be passed with Content-length header as I know.
You issue is same as in this question: link
And google for HTTP Specifications, if you want to make correct HTTP server.
Related
I am trying to automate an interaction with a game website by communicating with a WebSocket via Python.
In particular, I am trying to communicate with the WebSocket at: "wss://socket.colonist.io".
If I simply execute the following JS-code from the browser, I receive the incoming messages as expected:
ws = new WebSocket('wss://socket.colonist.io');
ws.onmessage = e => {
console.log(e);
}
However, as soon as I am trying to connect to this WebSocket from outside the browser (with Node.JS or with Python), the connection gets immediately closed by the remote. An example using websocket-client in Python can be found below:
import websocket
def on_message(ws, data):
print(f'received {data}')
websocket.enableTrace(True)
socket = websocket.WebSocketApp('wss://socket.colonist.io',
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'},
on_message=on_message)
socket.run_forever(origin='https://colonist.io')
socket.close()
The trace output is the following:
--- request header ---
GET / HTTP/1.1
Upgrade: websocket
Host: socket.colonist.io
Origin: https://colonist.io
Sec-WebSocket-Key: EE3U0EDp36JGZBHWUN5q4Q==
Sec-WebSocket-Version: 13
Connection: Upgrade
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
-----------------------
--- response header ---
HTTP/1.1 101 Switching Protocols
Server: nginx/1.18.0 (Ubuntu)
Date: Sat, 24 Sep 2022 17:33:32 GMT
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Accept: EwMJ+z82BuOBOSWONpuOhjNdVCQ=
-----------------------
websocket connected
Connection to remote host was lost. - goodbye
I also tried it using Python-Autobahn and Python-websockets, both with the same negative result.
I suspect the host somehow detects that the connection is not coming from a browser (although, I set a 'User-Agent' and the 'Origin') and therefore closes the connection immediately. Is there any possibility I can connect to this WebSocket from a Script NOT running in a browser?
I am aware of the possibility of using Selenium to run and control a browser instance with Python, but I want to avoid this at all cost, due to performance reasons. (I want to control as many WebSocket connections concurrently as possible for my project).
I found the problem. Because the connection worked from a new Incognito-Window from the Chrome-Console without ever visiting the host colonist.io and the "Application" tab from the Chrome developer panel did not show any stored cookies, I assumed no cookies were involved. After decrypting and analyzing the TLS communication with Wireshark I found out that a JWT gets sent as a cookie on the initial GET request. After adding this cookie to my Python implementation, it worked without any problems.
The remaining question now is: Where does this JWT come from if I don't see it in the "Application" tab and the only request being made is the WebSocket connection?
I have to make a basic proxy which intercepts the requests of the browser and sends back a standard response. It doesn't work if I try to send the response in response to a https request. The code I'm using is:
#after the server socket starts listening
conn, addr = server.accept()
request = conn.recv(4096)
print request
conn.send(b"HTTP/1.1 200 OK\n\n<p>Hello</p>")
conn.close()
Now for https requests, e.g.:
Got request:
CONNECT www.google.com:443 HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
Proxy-Connection: keep-alive
Connection: keep-alive
Host: www.google.com
I've tried sending the same response, but the browser shows "The connection was interrupted". The response has certainly been sent though. Am I right in thinking that to overcome this, I need to get a ssl certificate and send the response through a ssl socket?
(I'm not asking this because I'm too lazy to try it out, but setting up the certificate should take some time so I'd like to verify with someone who knows before wasting hours on a wrong hypothesis)
I'm working on a project that will access a specific site to do a search and then I will filter and return the value; the program logs in and then runs the search saving the cookie with a cookie jar to authenticate the connection while it runs the search . However when I run the program it returns no results and the packet header looks completely different. What am I doing wrong that the search always returns no results.
Here is my code:
import cookielib, urllib, urllib2
file= open('results.txt', 'wb')
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('Referer', 'http:// site that runs the search/psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL')]
opener.addheaders=[('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0')]
posts={'timezoneOffset':'180', 'userid':'user', 'pwd':'password', 'Submit':'Signon'}
data = urllib.urlencode(posts)
opens=opener.open('loginpage.com', data)
print cj
file.write(opens.read())
cjs=str(cj)
posts2 = urllib.urlencode({'ICType':'Panel', 'ICElementNum':0, 'ICStateNum':1, 'ICAction':'SRCH_ATD_TAP_WK_SRCH_PB', 'ICXPos':0, 'ICYPos':0, 'ICFocus':'', 'ICChanged':1, 'ICResubmit':0, 'ICFind':'', 'SRCH_ATD_TAP_WK_MSISDN_TAP':'', 'SRCH_ATD_TAP_WK_CNPJ_TAP':'', 'SRCH_ATD_TAP_WK_STATUS_RA_TAP':'', 'SRCH_ATD_TAP_WK_INTERACTION_ID':'', 'SRCH_ATD_TAP_WK_CASE_ID':48373914, 'SRCH_ATD_TAP_WK_PROTOCOLO_TAP':'', 'SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_INI_RA_TAP':'', 'SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP':0, 'SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP':'','SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP':''})
url2='searchpage.com'
opens2 = opener.open(url2, posts2)
str=opens2.read()
print cj
file.write(str + cjs)
file.close()
It connects the first time to the login page to save the cookie and then it connects to the the search page. Again this is just to be used on one site so the connections and post data are very specific.
Again, this code doesn't return any results (after searching the str var which is the entire unfiltered site.
Here are the results I get when scanning the the requests with wireshark, the first one is the site ran in firefox doing the search in a normal browser (including the post data sent) and the second one is my program running and automating the search for me.
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: site that runs the search/BANNER_TAP.SRCH_ATDO_TAP.GBL #note I wasn't able to create this header.
Cookie: SignOnDefault=my login id; PS_LOGINLIST=http:// siteroot; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=dpLmTCpY8vTmj4nMHbpyptPMdvphpRLR!841308261; ExpirePage=http:// siteroot/psp/p01ps1/; PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSfJDUA/BR2T3ekF0/cVhdJ7uJlpgAAAGIABVNkYXRhVnicHYpBCoAgFESfFi2jixRqYrgO2hbWvjN0vw7X5B94bxg+8BjbtBh09v05kJlxpGq1joOd0ksnGxc3KyUS9OSJjHIQPUtlYNLqK52Ya5Li+ABuIwtr; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:||||||; PS_360=PS_360_BO_ID_CUST!0!PS_360_CUST_SETID!!PS_360_BO_ID_CONT!0!PS_360_BO_ID_SITE!0!PS_360_CUST_ROLE!0!PS_360_CONT_ROLE!0!PS_360_BO_ID!0!PS_360_VIEW_OPTION!False; PS_TOKENEXPIRE=18_Feb_2014_00:04:41_GMT; HPTabName=DEFAULT
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 683
POST DATA: ICType=Panel&ICElementNum=0&ICStateNum=17&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&ICXPos=0&ICYPos=84&ICFocus=&ICChanged=1&ICResubmit=0&ICFind=&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_INTERACTION_ID=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Accept-Encoding: identity
Content-Length: 681
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Connection: close
Cookie: PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSX+ZILWKx7oU/VKvJbVT8LbueJtwAAAGIABVNkYXRhVnicJYpLCoAwDAWnVVyKF1Hsh2rXgluluvcM3s/DGWNCZh6PALexVY1Bxj4fOzKBkaSW1LCzUVrRwcrJxUKJeHlyRHqxFzomZWCQZlYm5b9Z7gVtawtT; ExpirePage=siteroot; PS_LOGINLIST=siteroot; PS_TOKENEXPIRE=18_Feb_2014_00:08:09_GMT; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=QG14TCkJK7PpfRtNH0CSCw9S1m6jtRR9!841308261; SignOnDefault=my login id; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:
Content-Type: application/x-www-form-urlencoded
POST DATA: SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&ICResubmit=0&ICXPos=0&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&ICElementNum=0&SRCH_ATD_TAP_WK_INTERACTION_ID=&ICType=Panel&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&ICFind=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&ICChanged=1&ICStateNum=1&ICYPos=0&ICFocus=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=
(This is for personal use at the company I work at to make this task more simple which needs to be done around 500 times at this point manualy. it is a site that registers protocols and we need to search the protocols to check if (later will import a list from excel) the protocol is closed of not)
note that I don't have the additional headers but if that could solve the problem I can. And for some reason my post data gets all disorganized ( but from what I understand about post data that shouldn't make a difference) and the cookie information is also somewhhat backwards, but that also shouldn't matter I would assum because to retrieve the cookie info is handled much like a python dictionary.
so I've been breaking my head over this little code and rewritting it several times for the past two weeks and I still can't get it to return the search results.
it's also important to note that I won't be able to install the browser core to be able to execute the javascript, but I also don't think that it's necessary do to the fact that the results from the search done on firefox show in wireshark, so the site is downloaded with the result. I was able to get mechanize running, but I havn't been able to try it yet. If there is a way to automate firefox (I don't remember which version at this moment) with python, that is an option that I'm open to.
One ore thing, because I'm working on this project at work, I'm not able to use and python plugin that has to be installed. I got mechanize to work because I open and copied the file over, with out running the setup.py. So just to make things easier, I have no way to install libraries.
You don't have PS_360 set in your cookie. Not sure how essential this is, but the best strategy going through these issues is to get step by step identical requests. Probably the first request to get ỳour cookie set was already different, or your browser has cookie data from previous requests, that you need to create manually for your request.
I've written a pretty basic polling proxy web server using Python's socket module. For the proxy I've written a simple readline() using socket's recv() function.
It goes something like this:
def readline(socket):
buffer = ''
char = socket.recv(1)
while char != '\n' and char != '':
buffer += char
char = socket.recv(1)
if char == '':
buffer = ''
else:
buffer += '\n'
return buffer
From my understanding, if recv() returns an empty string that means there was either a socket error or one side has closed their connection, thus when that happens I return
an empty string to my proxy to let it know the readline() has failed.
When running the proxy, I am able to access sites like youtube.com and yahoo.com, but whenever I try to access www.google.com my readline function always returns an empty string
on the very first readline (to read the request line in the HTTP request).
Any ideas?
EDIT:
Sorry I guess I was unclear. I am waiting for the request sent by my Mozilla Firefox client to my proxy server when 'http://www.google.com/' is typed into the address bar and that is where I am hanging. I'm not even getting to the part where I forward the request to the remote server and send back the response.
I think google might be waiting for your request header in first place, if it won't authorize you it will close connection. And you are not reading requests, you are sending requests. What you read are responses.
But it might be something different then no sent headers.
-- UPDATE --
Try to send these headers just after connection.
GET / HTTP/1.1
Host: google.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
X-Chrome-Variations: CM61yQEIk7bJAQiatskBCKa2yQEIp7bJAQiptskBCLi2yQEI34PKAQ==
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US;q=0.6,en;q=0.4
Accept-Charset: utf-8;q=0.7,*;q=0.3
You can also check what headers your browser is sending to google and what response you get using firebug.
Some Code
headers = {}
headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Language'] = 'en-gb,en;q=0.5'
#headers['Accept-Encoding'] = 'gzip, deflate'
request = urllib.request.Request(sURL, headers = headers)
try:
response = urllib.request.urlopen(request)
except error.HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: {0}'.format(e.code))
except error.URLError as e:
print('We failed to reach a server.')
print('Reason: {0}'.format(e.reason))
else:
f = open('output/{0}.html'.format(sFileName),'w')
f.write(response.read().decode('utf-8'))
A url
http://groupon.cl/descuentos/santiago-centro
The situation
Here's what I did:
enable javascript in browser
open url above and keep an eye on the console
disable javascript
repeat step 2 (for those of you who have just tuned in, javascript has now been disabled)
use urllib2 to grab the webpage and save it to a file
enable javascript
open the file with browser and observe console
repeat 7 with javascript off
results
In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising
Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing
in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected.
in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4
question
What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how.
What am I missing and how could I pull this off?
To paraphrase
If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML.
The seo friendly site wants me to get what I want because that's what seo is all about.
How would one go about getting plain HTML content given the situation I outlined?
To do it manually I would turn off js, navigate to the page, view source, ctrl-a, ctrl-c, ctrl-v(somewhere useful).
To get a script to do it for me I would...?
stuff I've tried
I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used.
Here are the headers my browser sends:
Host: groupon.cl
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Here are the headers my script sends:
Accept-Encoding: identity
Host: groupon.cl
Accept-Language: en-gb,en;q=0.5
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
The differences are:
my script has Connection = close instead of keep-alive. I can't see how this would cause a problem
my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now...
watch this space, I'll update the question as new info comes up