Python recv failing for certain websites - python

I've written a pretty basic polling proxy web server using Python's socket module. For the proxy I've written a simple readline() using socket's recv() function.
It goes something like this:
def readline(socket):
buffer = ''
char = socket.recv(1)
while char != '\n' and char != '':
buffer += char
char = socket.recv(1)
if char == '':
buffer = ''
else:
buffer += '\n'
return buffer
From my understanding, if recv() returns an empty string that means there was either a socket error or one side has closed their connection, thus when that happens I return
an empty string to my proxy to let it know the readline() has failed.
When running the proxy, I am able to access sites like youtube.com and yahoo.com, but whenever I try to access www.google.com my readline function always returns an empty string
on the very first readline (to read the request line in the HTTP request).
Any ideas?
EDIT:
Sorry I guess I was unclear. I am waiting for the request sent by my Mozilla Firefox client to my proxy server when 'http://www.google.com/' is typed into the address bar and that is where I am hanging. I'm not even getting to the part where I forward the request to the remote server and send back the response.

I think google might be waiting for your request header in first place, if it won't authorize you it will close connection. And you are not reading requests, you are sending requests. What you read are responses.
But it might be something different then no sent headers.
-- UPDATE --
Try to send these headers just after connection.
GET / HTTP/1.1
Host: google.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
X-Chrome-Variations: CM61yQEIk7bJAQiatskBCKa2yQEIp7bJAQiptskBCLi2yQEI34PKAQ==
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US;q=0.6,en;q=0.4
Accept-Charset: utf-8;q=0.7,*;q=0.3
You can also check what headers your browser is sending to google and what response you get using firebug.

Related

Python socket server can't decode redirect from OAUTH

I'm not sure how much of the code I can show, but the concept is simple. I am writing a python script that works with the TD Ameritrade API. I am getting a url for the portal from the API, and opening it in the browser. Next, I'm setting up a socket server to handle the redirect of the portal. Below is the code for the server:
serversocket = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
# get local machine name
host = "localhost"
port = 10120
serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
# bind to the port
serversocket.bind((host, port))
print("listening")
# queue up to 5 requests
serversocket.listen(5)
allData = ""
while True:
# establish a connection
try:
conn,addr = serversocket.accept()
except KeyboardInterrupt:
break
print("Got a connection from %s" % str(addr))
while True:
data = conn.recv(4096)
if not data:
print("done")
break
print(data.decode('utf-8', "ignore"))
conn.close()
When I go through the portal and get redirected, in the console I see the following:
Got a connection from ('127.0.0.1', 65505)
|,?2!c[N': [?`XAn] "::+/,0̨̩ / 5
jj localhost
3 + )http/1.1
ej\E<zpִ_<%q\r)+ - +
jj zz
However, if I were to copy the URL, open a new tab, paste it and go, I get the following (correct) response:
Got a connection from ('127.0.0.1', 49174)
GET /?code=<RESPONSE_TOKEN> HTTP/1.1
Host: localhost:10120
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.125
Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-
exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
When I go to the network overview of my browser, I see the following warning when trying to view the request headers: Provisional headers are shown
And the only difference between the http request from the redirect, and the request when I manually paste the url in is the initiator column in the network viewer is "oath" for the redirect, and "other" when manually pasted in.
I hope I've provided enough information and code. I can try to make a copy for reproducing if needed, but a TD Ameritrade Developer account would be needed to connect with the API.
Thanks in advance for any help. I've been researching for over 6 hours and wasn't able to find anything. Hopefully I didn't miss something obvious.
I think, socket is not required to handle an oauth redirect. Socket is for another kind of requeriments.
Also when you manually hit the redirect, a socket is not invoked. Just a simple http endpoint.
Try with this snippet which has the oauth code extract:
from urlparse import urlparse,parse_qsl
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
url = urlparse(self.path)
code = parse_qsl(url.query)['code']
Or this:
https://gist.github.com/willnix/daed2b57ab8d613f6bfa53c6d0b46fd3
You can get more snippets of simple http get endpoints here:
https://gist.github.com/search?q=def+do_GET+python&ref=searchresults

Do I have to get a ssl certificate to send to my own browser responses to https requests?

I have to make a basic proxy which intercepts the requests of the browser and sends back a standard response. It doesn't work if I try to send the response in response to a https request. The code I'm using is:
#after the server socket starts listening
conn, addr = server.accept()
request = conn.recv(4096)
print request
conn.send(b"HTTP/1.1 200 OK\n\n<p>Hello</p>")
conn.close()
Now for https requests, e.g.:
Got request:
CONNECT www.google.com:443 HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
Proxy-Connection: keep-alive
Connection: keep-alive
Host: www.google.com
I've tried sending the same response, but the browser shows "The connection was interrupted". The response has certainly been sent though. Am I right in thinking that to overcome this, I need to get a ssl certificate and send the response through a ssl socket?
(I'm not asking this because I'm too lazy to try it out, but setting up the certificate should take some time so I'd like to verify with someone who knows before wasting hours on a wrong hypothesis)

Page infinitely loading when connecting to my python web server

I've written a simple http web server using python but I've noticed that when I connect to it, the html page appears in the browser window but the indicator in the chrome tab continues to spin and the server receives empty strings. This continues until I click the 'X' to stop loading the page. Could someone please explain why this is happening and how to fix this. Also, if http headers are wrong or I'm missing important ones please tell me. I found it very difficult to find information on http headers and commands.
You find the code here.
Link to image of network tab
Console output:
Socket created
Bound socket
Socket now listening
Connected with 127.0.0.1:55146
Connected with 127.0.0.1:55147Received data: GET / HTTP/1.1
Host: localhost
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0
.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/33.0.1750.154 Safari/537.36
DNT: 1
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Parsing GET command
Client requested directory /index.html with HTTP version 1.1
html
/index.html
Reply headers:
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Server: Max's Python Web Server/1.0
Cache-Control: max-age=600, public
Connected with 127.0.0.1:55148
Received data:
Received data:
Received data:
Received data:
Your fault is how you think about sockets:
socket.recv will wait forever for data from clients
You don't need a loop here.
However, your requests will be limited by recv param.
But if you want to allow any size request,
you should detect the end of data by HTTP specification.
For example, if you wait headers only, double linefeed will mean they ends.
And size of body(for POST method for example) should be passed with Content-length header as I know.
You issue is same as in this question: link
And google for HTTP Specifications, if you want to make correct HTTP server.

urllib+cookielib packet manipulation in python

I'm working on a project that will access a specific site to do a search and then I will filter and return the value; the program logs in and then runs the search saving the cookie with a cookie jar to authenticate the connection while it runs the search . However when I run the program it returns no results and the packet header looks completely different. What am I doing wrong that the search always returns no results.
Here is my code:
import cookielib, urllib, urllib2
file= open('results.txt', 'wb')
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('Referer', 'http:// site that runs the search/psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL')]
opener.addheaders=[('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0')]
posts={'timezoneOffset':'180', 'userid':'user', 'pwd':'password', 'Submit':'Signon'}
data = urllib.urlencode(posts)
opens=opener.open('loginpage.com', data)
print cj
file.write(opens.read())
cjs=str(cj)
posts2 = urllib.urlencode({'ICType':'Panel', 'ICElementNum':0, 'ICStateNum':1, 'ICAction':'SRCH_ATD_TAP_WK_SRCH_PB', 'ICXPos':0, 'ICYPos':0, 'ICFocus':'', 'ICChanged':1, 'ICResubmit':0, 'ICFind':'', 'SRCH_ATD_TAP_WK_MSISDN_TAP':'', 'SRCH_ATD_TAP_WK_CNPJ_TAP':'', 'SRCH_ATD_TAP_WK_STATUS_RA_TAP':'', 'SRCH_ATD_TAP_WK_INTERACTION_ID':'', 'SRCH_ATD_TAP_WK_CASE_ID':48373914, 'SRCH_ATD_TAP_WK_PROTOCOLO_TAP':'', 'SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_INI_RA_TAP':'', 'SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP':0, 'SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP':'','SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP':''})
url2='searchpage.com'
opens2 = opener.open(url2, posts2)
str=opens2.read()
print cj
file.write(str + cjs)
file.close()
It connects the first time to the login page to save the cookie and then it connects to the the search page. Again this is just to be used on one site so the connections and post data are very specific.
Again, this code doesn't return any results (after searching the str var which is the entire unfiltered site.
Here are the results I get when scanning the the requests with wireshark, the first one is the site ran in firefox doing the search in a normal browser (including the post data sent) and the second one is my program running and automating the search for me.
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: site that runs the search/BANNER_TAP.SRCH_ATDO_TAP.GBL #note I wasn't able to create this header.
Cookie: SignOnDefault=my login id; PS_LOGINLIST=http:// siteroot; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=dpLmTCpY8vTmj4nMHbpyptPMdvphpRLR!841308261; ExpirePage=http:// siteroot/psp/p01ps1/; PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSfJDUA/BR2T3ekF0/cVhdJ7uJlpgAAAGIABVNkYXRhVnicHYpBCoAgFESfFi2jixRqYrgO2hbWvjN0vw7X5B94bxg+8BjbtBh09v05kJlxpGq1joOd0ksnGxc3KyUS9OSJjHIQPUtlYNLqK52Ya5Li+ABuIwtr; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:||||||; PS_360=PS_360_BO_ID_CUST!0!PS_360_CUST_SETID!!PS_360_BO_ID_CONT!0!PS_360_BO_ID_SITE!0!PS_360_CUST_ROLE!0!PS_360_CONT_ROLE!0!PS_360_BO_ID!0!PS_360_VIEW_OPTION!False; PS_TOKENEXPIRE=18_Feb_2014_00:04:41_GMT; HPTabName=DEFAULT
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 683
POST DATA: ICType=Panel&ICElementNum=0&ICStateNum=17&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&ICXPos=0&ICYPos=84&ICFocus=&ICChanged=1&ICResubmit=0&ICFind=&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_INTERACTION_ID=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Accept-Encoding: identity
Content-Length: 681
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Connection: close
Cookie: PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSX+ZILWKx7oU/VKvJbVT8LbueJtwAAAGIABVNkYXRhVnicJYpLCoAwDAWnVVyKF1Hsh2rXgluluvcM3s/DGWNCZh6PALexVY1Bxj4fOzKBkaSW1LCzUVrRwcrJxUKJeHlyRHqxFzomZWCQZlYm5b9Z7gVtawtT; ExpirePage=siteroot; PS_LOGINLIST=siteroot; PS_TOKENEXPIRE=18_Feb_2014_00:08:09_GMT; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=QG14TCkJK7PpfRtNH0CSCw9S1m6jtRR9!841308261; SignOnDefault=my login id; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:
Content-Type: application/x-www-form-urlencoded
POST DATA: SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&ICResubmit=0&ICXPos=0&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&ICElementNum=0&SRCH_ATD_TAP_WK_INTERACTION_ID=&ICType=Panel&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&ICFind=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&ICChanged=1&ICStateNum=1&ICYPos=0&ICFocus=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=
(This is for personal use at the company I work at to make this task more simple which needs to be done around 500 times at this point manualy. it is a site that registers protocols and we need to search the protocols to check if (later will import a list from excel) the protocol is closed of not)
note that I don't have the additional headers but if that could solve the problem I can. And for some reason my post data gets all disorganized ( but from what I understand about post data that shouldn't make a difference) and the cookie information is also somewhhat backwards, but that also shouldn't matter I would assum because to retrieve the cookie info is handled much like a python dictionary.
so I've been breaking my head over this little code and rewritting it several times for the past two weeks and I still can't get it to return the search results.
it's also important to note that I won't be able to install the browser core to be able to execute the javascript, but I also don't think that it's necessary do to the fact that the results from the search done on firefox show in wireshark, so the site is downloaded with the result. I was able to get mechanize running, but I havn't been able to try it yet. If there is a way to automate firefox (I don't remember which version at this moment) with python, that is an option that I'm open to.
One ore thing, because I'm working on this project at work, I'm not able to use and python plugin that has to be installed. I got mechanize to work because I open and copied the file over, with out running the setup.py. So just to make things easier, I have no way to install libraries.
You don't have PS_360 set in your cookie. Not sure how essential this is, but the best strategy going through these issues is to get step by step identical requests. Probably the first request to get ỳour cookie set was already different, or your browser has cookie data from previous requests, that you need to create manually for your request.

urllib2 misbehaving with dynamically loaded content

Some Code
headers = {}
headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Language'] = 'en-gb,en;q=0.5'
#headers['Accept-Encoding'] = 'gzip, deflate'
request = urllib.request.Request(sURL, headers = headers)
try:
response = urllib.request.urlopen(request)
except error.HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: {0}'.format(e.code))
except error.URLError as e:
print('We failed to reach a server.')
print('Reason: {0}'.format(e.reason))
else:
f = open('output/{0}.html'.format(sFileName),'w')
f.write(response.read().decode('utf-8'))
A url
http://groupon.cl/descuentos/santiago-centro
The situation
Here's what I did:
enable javascript in browser
open url above and keep an eye on the console
disable javascript
repeat step 2 (for those of you who have just tuned in, javascript has now been disabled)
use urllib2 to grab the webpage and save it to a file
enable javascript
open the file with browser and observe console
repeat 7 with javascript off
results
In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising
Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing
in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected.
in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4
question
What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how.
What am I missing and how could I pull this off?
To paraphrase
If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML.
The seo friendly site wants me to get what I want because that's what seo is all about.
How would one go about getting plain HTML content given the situation I outlined?
To do it manually I would turn off js, navigate to the page, view source, ctrl-a, ctrl-c, ctrl-v(somewhere useful).
To get a script to do it for me I would...?
stuff I've tried
I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used.
Here are the headers my browser sends:
Host: groupon.cl
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Here are the headers my script sends:
Accept-Encoding: identity
Host: groupon.cl
Accept-Language: en-gb,en;q=0.5
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
The differences are:
my script has Connection = close instead of keep-alive. I can't see how this would cause a problem
my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now...
watch this space, I'll update the question as new info comes up

Categories