Python Extract JSON from HTTP Response - python

Say I have the following HTTP request:
GET /4 HTTP/1.1
Host: graph.facebook.com
And the server returns the following response:
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Cache-Control: private, no-cache, no-store, must-revalidate
Content-Type: text/javascript; charset=UTF-8
ETag: "539feb8aee5c3d20a2ebacd02db380b27243b255"
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Pragma: no-cache
X-FB-Rev: 1070755
X-FB-Debug: pC4b0ONpdhLwBn6jcabovcZf44bkfKSEguNsVKuSI1I=
Date: Wed, 08 Jan 2014 01:22:36 GMT
Connection: keep-alive
Content-Length: 172
{"id":"4","name":"Mark Zuckerberg","first_name":"Mark","last_name":"Zuckerberg","link":"http:\/\/www.facebook.com\/zuck","username":"zuck","gender":"male","locale":"en_US"}
Since the Content-Lengh header depends on the length of the content, I cannot simply split by the Content-Length: 172 string. How can I extract the JSON and headers separately? They are both important to my program.
I am using this code to get the response:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("graph.facebook.com", 80))
s.send("GET /"+str(id)+"/picture HTTP/1.1\r\nHost: graph.facebook.com\r\n\r\n")
data = s.recv(1024)
s.close()
json_string = (somehow extract this)
userdata = json.loads(json_string)

The easy way to do this is to use an HTTP library. For example:
import json
import urllib2
r = urllib2.urlopen("http://graph.facebook.com/{}/picture".format(id))
json_string = r.read()
userdata = json.loads(json_string)
If you really want to parse it yourself, the HTTP protocol guarantees that headers and body are separated by an empty line, and that this will be the first empty line anywhere in the response, so it's not that hard:
data = s.recv(1024)
header, _, json_string = data.partition('\r\n\r\n')
userdata = json.loads(json_string)
There are some obvious down sides to this—as written, your code won't work if the response is longer than 1K, or if the kernel doesn't give you the whole response in a single recv (which it's never guaranteed to do), or if the server redirects you or gives you a 100 CONTINUE before the real response, or if the server decides to send back a chunked or MIME-multipart or other response instead of a flat body, or…

Related

Can someone explain get requests, specifically the http header?

I'm new with rest APIs and I'm trying to set up an OAuth handshake and I need help with requesting the request token. I'm using the requests_oauthlib module in Python. Here is the sample code and it is returning Response [400].
consumer_key, consumer_secret, and request_url are all loaded in properly. I got my code to work using a different Auth module. Can someone explain what http headers are and how they are used in a GET request?
from requests_oauthlib import OAuth1
from variables import *
oauth = OAuth1(consumer_key, client_secret = consumer_secret)
request_token = requests.get(request_url, auth=oauth, params={'oauth_callback':'oob', 'format':'json'})
print request_token
request : your computer sends a http message to another computer usually on port 443 or 80
response : the server listens for any connection requests, and responds if it understands the message.
For example telnet stackoverflow.com 80 you can type in
GET /questions/52350391/can-someone-explain-get-requests-specifically-the-http-header HTTP/2
Host: stackoverflow.com
User-Agent: curl/7.54.0
Accept: */*
And then press enter twice to conclude the request header, at which point the server responds:
➜ mysite telnet stackoverflow.com 80
Trying 151.101.1.69...
Connected to stackoverflow.com.
Escape character is '^]'.
GET /questions/52350391/can-someone-explain-get-requests-specifically-the-http- header HTTP/2
Host: stackoverflow.com
User-Agent: curl/7.54.0
Accept: */*
HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=utf-8
Location: https://stackoverflow.com/questions/52350391/can-someone-explain-get-requests-specifically-the-http-
X-Request-Guid: xxx
Content-Security-Policy: upgrade-insecure-requests
Accept-Ranges: bytes
Age: 0
Content-Length: 217
Accept-Ranges: bytes
Date: Sun, 16 Sep 2018 03:29:16 GMT
Via: 1.1 varnish
Age: 0
Connection: close
X-Served-By: cache-ord1744-ORD
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1537068557.736123,VS0,VE25
Vary: Fastly-SSL
X-DNS-Prefetch-Control: off
Set-Cookie: prov=xxx; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here.</h2>
</body></html>
Connection closed by foreign host.
The telnet session then prints out the response from the server and closes the connection. The response will include several pieces, the response headers, and the response body.
Your example might look something like:
GET /some/oauth/api?oauth_callback=oob&format=json
Host: someplace.com
Authorization: Bearer asdfasdfasdfasdf
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: no-store
Pragma: no-cache
{
"access_token":"sdfasasdfasdf",
"token_type":"bearer",
"expires_in":3600,
"refresh_token":"asdfasdfasdfasdf",
"scope":"create"
}
also check out :
curl -Lv https://stackoverflow.com/questions/52350391/can-someone-explain-get-requests-specifically-the-http-header | head -n 100
related:
https://www.oauth.com/oauth2-servers/access-tokens/access-token-response/
HTTP Request\Response Header Grammer
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers
https://www.oauth.com/oauth2-servers/authorization/the-authorization-request/

How to make httplib debugger infomation into logger debug level

By default httplib debug send, headers and reply information returns as logger.info,
Instead can how do i display send, headers and replay as part of Debug information?
import requests
import logging
import httplib
httplib.HTTPConnection.debuglevel = 1
logging.basicConfig() # you need to initialize logging, otherwise you will not see anything from requests
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
requests.get('http://httpbin.org/headers')
It prints
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP Connection (1):
httpbin.org
send: 'GET /headers HTTP/1.1\r\nHost: httpbin.org\r\nConnection: keep-alive\r\nA
ccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.8.
1\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Mon, 14 Dec 2015 12:50:44 GMT
header: Content-Type: application/json
header: Content-Length: 156
header: Connection: keep-alive
header: Access-Control-Allow-Origin: *
header: Access-Control-Allow-Credentials: true
DEBUG:requests.packages.urllib3.connectionpool:"GET /headers HTTP/1.1" 200 156
<Response [200]>
Thanks #Eli
I could achieve using this post http://stefaanlippens.net/redirect_python_print
import logging
import sys
import requests
import httplib
# HTTP stream handler
class WritableObject:
def __init__(self):
self.content = []
def write(self, string):
self.content.append(string)
# A writable object
http_log = WritableObject()
# Redirection
sys.stdout = http_log
# Enable
httplib.HTTPConnection.debuglevel = 2
# get operation
requests.get('http://httpbin.org/headers')
# Remember to reset sys.stdout!
sys.stdout = sys.__stdout__
debug_info = ''.join(http_log.content).replace('\\r', '').decode('string_escape').replace('\'', '')
# Remove empty lines
debug_info = "\n".join([ll.rstrip() for ll in debug_info.splitlines() if ll.strip()])
It prints like
C:\Users\vkosuri\Dropbox\robot\lab>python New-Redirect_Stdout.py
send: GET /headers HTTP/1.1
Host: httpbin.org
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.8.1
reply: HTTP/1.1 200 OK
header: Server: nginx
header: Date: Tue, 15 Dec 2015 09:36:36 GMT
header: Content-Type: application/json
header: Content-Length: 156
header: Connection: keep-alive
header: Access-Control-Allow-Origin: *
header: Access-Control-Allow-Credentials: true
Thanks
Malli
some_logger.set_level() does not do what you think it does. It doesn't set the level of the logs being emitted by a logger. It sets the minimum level of log emitted by the logger that your handler will care about and acknowledge. To do what you're asking, I can only think of one real, reasonable way:
Capture the logs as they're coming in and re-log them. You can capture them with the idea described here, and use that in a subclass of requests. This would without a doubt be complicated. So, this is probably a good time to start asking yourself, "what am I really trying to achieve and is there another way to go about it?"

Python urllib open issue

I'm trying to fetch data from http://book.libertorrent.com/, but at the moment I'm failing badly because some additional data (headers) present in response. My code is very simple:
response = urllib.urlopen('http://book.libertorrent.com/login.php')
f = open('someFile.html', 'w')
f.write(response.read())
read() returns:
Date: Fri, 09 Nov 2012 07:36:54 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Cache-Control: no-cache, pre-check=0, post-check=0
Expires: 0
Pragma: no-cache
Set-Cookie: bb_test=973132321; path=/; domain=book.libertorrent.com
Content-Language: ru
1ec0
...Html...
0
And response.info() is empty.
is there any way to correct response?
Let's try this:
$ echo -ne "GET /index.php HTTP/1.1\r\nHost: book.libertorrent.com\r\n\r\n" | nc book.libertorrent.com 80 | head -n 10
HTTP/1.1 200 OK
WWW
Date: Sat, 10 Nov 2012 17:41:57 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Content-Language: ru
1f57
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html dir="ltr">
See that "WWW" in the second line? That's no valid HTTP header, I'm guessing that's what's throwing off the response parser here.
By the way, python2 and python3 behave differently here:
python2 seems to immediately interpret anything after this invalid header as content
python3 ignores all headers and continues reading the content after the double newline. Because the headers are ignored, so is the transfer encoding, and therfore the content lengths are interpreted as part of the body.
So in the end the problem is that the server is sending an invalid response, which should be fixed at the server's end.

HTTP telnet POST/GAE server question (SIMPLE STUFF)

I am playing with HTTP transfers, just trying to make something work. I have a GAE server and I'm pretty sure it's working properly because it renders when I go to it with my browser, but here is the python code anyway:
import sys
print 'Content-Type: text/html'
print ''
print '<pre>'
number = -1
data = sys.stdin.read()
try:
number = int(data[data.find('=')+1:])
except:
number = -1
print 'If your number was', number, ', then you are awesome!!!'
print '</pre>'
I am just learning the whole HTTP POST vs GET vs Response process, but this is what I have been doing from the terminal:
$ telnet localhost 8080
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET http://localhost:8080/?number=28 HTTP/1.0
HTTP/1.0 200 Good to go
Server: Development/1.0
Date: Thu, 07 Jul 2011 21:29:28 GMT
Content-Type: text/html
Cache-Control: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Content-Length: 61
<pre>
If your number was -1 , then you are awesome!!!
</pre>
Connection closed by foreign host.
I am using a GET here because I stumbled around for about 40 minutes trying to make a telnet POST work - with no success :(
I would appreciate any help on how to get this GET and/or the POST to work. Thanks in advance!!!!
when using GET, no data will be present in the request body, so sys.stdin.read() is bound to fail. instead, you might want to look at the environment, specifically os.environ['QUERY_STRING']
Another thing you're doing a little strangely is you are not using the correct request format. The second part of the request should not include the url scheme, host or port, it should look like:
GET /?number=28 HTTP/1.0
specify the host in a seperate Host: header; the server will determine the scheme on it's own.
When using POST, most servers won't read past the amount of data in the Content-Length header, which if you don't supply one, may be assumed to be zero bytes. The server may try to read any bytes after the point specified by the content-length to be the next request in a persistent connection, and when it doesn't begin with a valid request, it closes the connection. So basically:
POST / HTTP/1.0
Host: localhost: 8080
Content-Length: 2
Content-Type: text/plain
28
But why are you testing this in telnet? How about curl?
$ curl -vs -d'28' -H'Content-Type: text/plain' http://localhost:8004/
* About to connect() to localhost port 8004 (#0)
* Trying ::1... Connection refused
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 8004 (#0)
> POST / HTTP/1.1
> User-Agent: curl/7.20.1 (x86_64-redhat-linux-gnu) libcurl/7.20.1 NSS/3.12.6.2 zlib/1.2.3 libidn/1.16 libssh2/1.2.4
> Host: localhost:8004
> Accept: */*
> Content-Type: text/plain
> Content-Length: 2
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Date: Thu, 07 Jul 2011 22:09:17 GMT
< Server: WSGIServer/0.1 Python/2.6.4
< Content-Type: text/html; charset=UTF-8
< Content-Length: 45
<
* Closing connection #0
{'body': '28', 'method': 'POST', 'query': []}
or better yet, in python:
>>> import httplib
>>> headers = {"Content-type": "text/plain",
... "Accept": "text/plain"}
>>>
>>> conn = httplib.HTTPConnection("localhost:8004")
>>> conn.request("POST", "/", "28", headers)
>>> response = conn.getresponse()
>>> print response.read()
{'body': '28', 'method': 'POST', 'query': []}
>>>

Unable to parse the output from HTTPConnection.debuglevel

I am trying to programmability check the output of a tcp stream. I am able to get the results of the tcp stream by turning on debug in HTTPConnection but how do I read the data and evaluate it with say a regular expression. I keep getting "TypeError: expected string or buffer". Is there a way to convert the result to a string?
thanks!
SCRIPT:
from urllib2 import Request, urlopen, URLError, HTTPError
import urllib2
import cookielib
import httplib
import re
httplib.HTTPConnection.debuglevel = 1
p = re.compile('abc=..........')
cj = cookielib.CookieJar()
proxy_address = '192.168.232.134:8083' # change the IP:PORT, this one is for example
proxy_handler = urllib2.ProxyHandler({'http': proxy_address})
opener = urllib2.build_opener(proxy_handler, urllib2.HTTPCookieProcessor(cj), urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
url = "http://www.google.com/" # change the url
req=urllib2.Request(url)
data=urllib2.urlopen(req)
m=p.match(data)
if m:
print "Match found."
else:
print "Match not found."
RESULTS:
send: 'GET hyperlink/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 303 See Other\r\n'
header: Location: hyperlink:8083/3240951276
header: Set-Cookie: abc=3240951276; path=/; domain=.google.com; expires=Thu, 31-Dec-2020 23:59:59 GMT
header: Content-Length: 0
send: 'GET hyperlink/3240951276 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: hyperlink\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 303 See Other\r\n'
header: Location: hyperlink
header: Set-Cookie: abc=3240951276; path=/; expires=Thu, 31-Dec-2020 23:59:59 GMT
header: Content-Length: 0
send: 'GET http://www.google.com/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nCookie: abc=3240951276\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Mon, 18 Oct 2010 21:09:32 GMT
header: Expires: -1
header: cache-control: max-age=0, private, private
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=066bc785a2b15ef6:FF=0:TM=1287436172:LM=1287436172:S=mNiXaRhshpf8nLji; expires=Wed, 17-Oct-2012 21:09:32 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=39=ur3gnXL80kEy4shKAh8_-XV8PhmS4G83slPcX9OD3L6uthQZw-wq7RUnB0WKGYR3F_QGoyZAyEPCvjdi69EXXq23dzvpuZSl_KU2o7pqcTB7Vym4co1LOXmi9YQGpbkb; expires=Tue, 19-Apr-2011 21:09:32 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
header: Content-Length: 4676
header: X-Con-Reuse: 1
header: Content-Encoding: gzip
header: via: 1.1 HermesPrefetch (CID2627003316.AID3240951276.TID1)
header: X-Trace-Timing: Start=1287436172845, Sched=0, Dns=2, Con=11, RxS=28, RxD=35
Traceback (most recent call last):
File "C:\Documents and Settings\asdf\workspace\PythonScripts2\src\Test1.py", line 18, in <module>
m=p.match(data)
TypeError: expected string or buffer
The debug information httplib provides you there, which you see in your terminal, is not actually part of the object returned by urllib2.urlopen(). Instead, it's printed directly to your process's sys.stdout. There's no way to change this behaviour in httplib, unfortunately. It's not entirely clear to me what you're trying to achieve by "capturing" this output and running a regular expression over it, but if that's really what you want to do, you would need to replace sys.stdout with something else, such as a suitable StringIO object, and somehow seeing which output is the output you care about.
However, keep in mind that all the information that httplib produces in its debug output is available directly in your program. It's either based on stuff you pass to httplib (through urllib2) or it's part of the server's response, and thus available in the object returned by urllib2.urlopen(). For example, it looks like you're trying to extract the cookie information, which you can get at simply by getting the cookie from the CookieJar you're already providing. There doesn't seem to be any sensible reason to try and capture the output and parsing it.

Categories