Reading HTTP server push streams with Python - python

I'm playing around trying to write a client for a site which provides data as an HTTP stream (aka HTTP server push). However, urllib2.urlopen() grabs the stream in its current state and then closes the connection. I tried skipping urllib2 and using httplib directly, but this seems to have the same behaviour.
The request is a POST request with a set of five parameters. There are no cookies or authentication required, however.
Is there a way to get the stream to stay open, so it can be checked each program loop for new contents, rather than waiting for the whole thing to be redownloaded every few seconds, introducing lag?

You could try the requests lib.
import requests
r = requests.get('http://httpbin.org/stream/20', stream=True)
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print line
You also could add parameters:
import requests
settings = { 'interval': '1000', 'count':'50' }
url = 'http://agent.mtconnect.org/sample'
r = requests.get(url, params=settings, stream=True)
for line in r.iter_lines():
if line:
print line

Do you need to actually parse the response headers, or are you mainly interested in the content? And is your HTTP request complex, making you set cookies and other headers, or will a very simple request suffice?
If you only care about the body of the HTTP response and don't have a very fancy request, you should consider simply using a socket connection:
import socket
SERVER_ADDR = ("example.com", 80)
sock = socket.create_connection(SERVER_ADDR)
f = sock.makefile("r+", bufsize=0)
f.write("GET / HTTP/1.0\r\n"
+ "Host: example.com\r\n" # you can put other headers here too
+ "\r\n")
# skip headers
while f.readline() != "\r\n":
pass
# keep reading forever
while True:
line = f.readline() # blocks until more data is available
if not line:
break # we ran out of data!
print line
sock.close()

One way to do it using urllib2 is (assuming this site also requires Basic Auth):
import urllib2
p_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
url = 'http://streamingsite.com'
p_mgr.add_password(None, url, 'login', 'password')
auth = urllib2.HTTPBasicAuthHandler(p_mgr)
opener = urllib2.build_opener(auth)
urllib2.install_opener(opener)
f = opener.open('http://streamingsite.com')
while True:
data = f.readline()

Related

Derive protocol from url

I do have a list of urls such as ["www.bol.com ","www.dopper.com"]format.
In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.
For example:
["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]
As you see the protocol might differ from https to http or even with or without www.
Not sure if there are any other variations.
is there any python tool that can determine the right protocol?
If not and I have to build the logic by myself what are the cases that I should take into account?
For option 2, this is what I have so far:
def identify_protocol(url):
try:
r = requests.get("https://" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("http//" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
return r.url, r.status_code
except:
return None, None
is there any other possibility I should take into account?
There is no way to determine the protocol/full domain from the fragment directly, the information simply isn't there. In order to find it you would either need:
a database of the correct protocol/domains, which you can lookup your domain fragment in
to make the request and see what the server tells you
If you do (2) you can of course gradually build your own database to avoid needing the request in future.
On many https servers, if you attempt a http connection you will be redirected to https. If you are not, then you can reliably use the http. If the http fails, then you could try again with https and see if it works.
The same applies to the domain: if the site usually redirects, you can perform the request using the original domain and see where you are redirected.
An example using requests:
>>> import requests
>>> r = requests.get('http://bol.com')
>>> r
<Response [200]>
>>> r.url
'https://www.bol.com/nl/nl/'
As you can see the request object url parameter has the final destination URL, plus protocol.
As I understood question, you need to retrieve final url after all possible redirections. It could be done with built-in urllib.request. If provided url has no scheme you can use http as default. To parse input url I used combination of urlsplit() and urlunsplit().
Code:
import urllib.request as request
import urllib.parse as parse
def find_redirect_location(url, proxy=None):
parsed_url = parse.urlsplit(url.strip())
url = parse.urlunsplit((
parsed_url.scheme or "http",
parsed_url.netloc or parsed_url.path,
parsed_url.path.rstrip("/") + "/" if parsed_url.netloc else "/",
parsed_url.query,
parsed_url.fragment
))
if proxy:
handler = request.ProxyHandler(dict.fromkeys(("http", "https"), proxy))
opener = request.build_opener(handler, request.ProxyBasicAuthHandler())
else:
opener = request.build_opener()
with opener.open(url) as response:
return response.url
Then you can just call this function on every url in list:
urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(map(find_redirect_location, urls))
You can also use proxies:
from itertools import cycle
urls = ["bol.com ","www.dopper.com", "https://google.com"]
proxies = ["http://localhost:8888"]
final_urls = list(map(find_redirect_location, urls, cycle(proxies)))
To make it a bit faster you can make checks in parallel threads using ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(ThreadPoolExecutor().map(find_redirect_location, urls))

Reading URL socket backwards in Python

I'm attempting to pull information from a log file posted online and read through the output. The only information i really need is posted at the end of the file. These files are pretty big and storing the entire socket output to a variable and reading through it is consuming alot of internal memory. is there a was to read the socket from bottom to top?
What I currently have:
socket = urllib.urlopen(urlString)
OUTPUT = socket.read()
socket.close()
OUTPUT = OUTPUT.split("\n")
for line in OUTPUT:
if "xxxx" in line:
print line
I am using Python 2.7. I pretty much want to read about 30 lines from the very end of the output of Socket.
What you want in this use case is the HTTP Range request. Here is tutorial I located:
http://stuff-things.net/2015/05/13/web-scale-http-tail/
I should clarify: the advantage of getting the size with a Head request, then doing a Range request, is that you do not have to transfer all the content. You mentioned you have pretty big file resources, so this is going to be the best solution :)
edit: added this code below...
Here is a demo (simplified) of that blog article, but translated into Python. Please note this will not work with all HTTP servers! More comments inline:
"""
illustration of how to 'tail' a file using http. this will not work on all
webservers! if you need an http server to test with you can try the
rangehttpserver module:
$ pip install requests
$ pip install rangehttpserver
$ python -m RangeHTTPServer
"""
import requests
TAIL_SIZE = 1024
url = 'http://localhost:8000/lorem-ipsum.txt'
response = requests.head(url)
# not all servers return content-length in head, for some reason
assert 'content-length' in response.headers, 'Content length unknown- out of luck!'
# check the the resource length and construct a request header for that range
full_length = int(response.headers['content-length'])
assert full_length > TAIL_SIZE
headers = {
'range': 'bytes={}-{}'.format( full_length - TAIL_SIZE, full_length)
}
# Make a get request, with the range header
response = requests.get(url, headers=headers)
assert 'accept-ranges' in response.headers, 'Accept-ranges response header missing'
assert response.headers['accept-ranges'] == 'bytes'
assert len(response.text) == TAIL_SIZE
# Otherwise you get the entire file
response = requests.get(url)
assert len(response.text) == full_length

How to download a large file with httplib2

Is it possible to download a large file in chunks using httplib2. I am downloading files from a Google API, and in order to use the credentials from the google OAuth2WebServerFlow, I am bound to use httplib2.
At the moment I am doing:
flow = OAuth2WebServerFlow(
client_id=XXXX,
client_secret=XXXX,
scope=XYZ,
redirect_uri=XYZ
)
credentials = flow.step2_exchange(oauth_code)
http = httplib2.Http()
http = credentials.authorize(http)
resp, content = self.http.request(url, "GET")
with open(file_name, 'wb') as fw:
fw.write(content)
But the content variable can get more than 500MB.
Any way of reading the response in chunks?
You could consider streaming_httplib2, a fork of httplib2 with exactly that change in behaviour.
in order to use the credentials from the google OAuth2WebServerFlow, I am bound to use httplib2.
If you need features that aren't available in httplib2, it's worth looking at how much work it would be to get your credential handling working with another HTTP library. It may be a good longer-term investment. (e.g. How to download large file in python with requests.py?.)
About reading response in chunks (works with httplib, must work with httplib2)
import httplib
conn = httplib.HTTPConnection("google.com")
conn.request("GET", "/")
r1 = conn.getresponse()
try:
print r1.fp.next()
print r1.fp.next()
except:
print "Exception handled!"
Note: next() may raise StopIteration exception, you need to handle it.
You can avoid calling next() like this
F=open("file.html","w")
for n in r1.fp:
F.write(n)
F.flush()
You can apply oauth2client.client.Credentials to a urllib2 request.
First, obtain the credentials object. In your case, you're using:
credentials = flow.step2_exchange(oauth_code)
Now, use that object to get the auth headers and add them to the urllib2 request:
req = urllib2.Request(url)
auth_headers = {}
credentials.apply(auth_headers)
for k,v in auth_headers.iteritems():
req.add_header(k,v)
resp = urllib2.urlopen(req)
Now resp is a file-like object that you can use to read the contents of the URL

Making a POST call instead of GET using urllib2

There's a lot of stuff out there on urllib2 and POST calls, but I'm stuck on a problem.
I'm trying to do a simple POST call to a service:
url = 'http://myserver/post_service'
data = urllib.urlencode({'name' : 'joe',
'age' : '10'})
content = urllib2.urlopen(url=url, data=data).read()
print content
I can see the server logs and it says that I'm doing GET calls, when I'm sending the data
argument to urlopen.
The library is raising an 404 error (not found), which is correct for a GET call, POST calls are processed well (I'm also trying with a POST within a HTML form).
Do it in stages, and modify the object, like this:
# make a string with the request type in it:
method = "POST"
# create a handler. you can specify different handlers here (file uploads etc)
# but we go for the default
handler = urllib2.HTTPHandler()
# create an openerdirector instance
opener = urllib2.build_opener(handler)
# build a request
data = urllib.urlencode(dictionary_of_POST_fields_or_None)
request = urllib2.Request(url, data=data)
# add any other information you want
request.add_header("Content-Type",'application/json')
# overload the get method function with a small anonymous function...
request.get_method = lambda: method
# try it; don't forget to catch the result
try:
connection = opener.open(request)
except urllib2.HTTPError,e:
connection = e
# check. Substitute with appropriate HTTP code.
if connection.code == 200:
data = connection.read()
else:
# handle the error case. connection.read() will still contain data
# if any was returned, but it probably won't be of any use
This way allows you to extend to making PUT, DELETE, HEAD and OPTIONS requests too, simply by substituting the value of method or even wrapping it up in a function. Depending on what you're trying to do, you may also need a different HTTP handler, e.g. for multi file upload.
This may have been answered before: Python URLLib / URLLib2 POST.
Your server is likely performing a 302 redirect from http://myserver/post_service to http://myserver/post_service/. When the 302 redirect is performed, the request changes from POST to GET (see Issue 1401). Try changing url to http://myserver/post_service/.
Have a read of the urllib Missing Manual. Pulled from there is the following simple example of a POST request.
url = 'http://myserver/post_service'
data = urllib.urlencode({'name' : 'joe', 'age' : '10'})
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
print response.read()
As suggested by #Michael Kent do consider requests, it's great.
EDIT: This said, I do not know why passing data to urlopen() does not result in a POST request; It should. I suspect your server is redirecting, or misbehaving.
The requests module may ease your pain.
url = 'http://myserver/post_service'
data = dict(name='joe', age='10')
r = requests.post(url, data=data, allow_redirects=True)
print r.content
it should be sending a POST if you provide a data parameter (like you are doing):
from the docs:
"the HTTP request will be a POST instead of a GET when the data parameter is provided"
so.. add some debug output to see what's up from the client side.
you can modify your code to this and try again:
import urllib
import urllib2
url = 'http://myserver/post_service'
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
data = urllib.urlencode({'name' : 'joe',
'age' : '10'})
content = opener.open(url, data=data).read()
Try this instead:
url = 'http://myserver/post_service'
data = urllib.urlencode({'name' : 'joe',
'age' : '10'})
req = urllib2.Request(url=url,data=data)
content = urllib2.urlopen(req).read()
print content
url="https://myserver/post_service"
data["name"] = "joe"
data["age"] = "20"
data_encoded = urllib2.urlencode(data)
print urllib2.urlopen(url + "?" + data_encoded).read()
May be this can help

Adding Cookie to SOAPpy Request

I'm trying to send a SOAP request using SOAPpy as the client. I've found some documentation stating how to add a cookie by extending SOAPpy.HTTPTransport, but I can't seem to get it to work.
I tried to use the example here,
but the server I'm trying to send the request to started throwing 415 errors, so I'm trying to accomplish this without using ClientCookie, or by figuring out why the server is throwing 415's when I do use it. I suspect it might be because ClientCookie uses urllib2 & http/1.1, whereas SOAPpy uses urllib & http/1.0
Does someone know how to make ClientCookie use http/1.0, if that is even the problem, or a way to add a cookie to the SOAPpy headers without using ClientCookie? If tried this code using other services, it only seems to throw errors when sending requests to Microsoft servers.
I'm still finding my footing with python, so it could just be me doing something dumb.
import sys, os, string
from SOAPpy import WSDL,HTTPTransport,Config,SOAPAddress,Types
import ClientCookie
Config.cookieJar = ClientCookie.MozillaCookieJar()
class CookieTransport(HTTPTransport):
def call(self, addr, data, namespace, soapaction = None, encoding = None,
http_proxy = None, config = Config):
if not isinstance(addr, SOAPAddress):
addr = SOAPAddress(addr, config)
cookie_cutter = ClientCookie.HTTPCookieProcessor(config.cookieJar)
hh = ClientCookie.HTTPHandler()
hh.set_http_debuglevel(1)
# TODO proxy support
opener = ClientCookie.build_opener(cookie_cutter, hh)
t = 'text/xml';
if encoding != None:
t += '; charset="%s"' % encoding
opener.addheaders = [("Content-Type", t),
("Cookie", "Username=foobar"), # ClientCookie should handle
("SOAPAction" , "%s" % (soapaction))]
response = opener.open(addr.proto + "://" + addr.host + addr.path, data)
data = response.read()
# get the new namespace
if namespace is None:
new_ns = None
else:
new_ns = self.getNS(namespace, data)
print '\n' * 4 , '-'*50
# return response payload
return data, new_ns
url = 'http://www.authorstream.com/Services/Test.asmx?WSDL'
proxy = WSDL.Proxy(url, transport=CookieTransport)
print proxy.GetList()
Error 415 is because of incorrect content-type header.
Install httpfox for firefox or whatever tool (wireshark, Charles or Fiddler) to track what headers are you sending. Try Content-Type: application/xml.
...
t = 'application/xml';
if encoding != None:
t += '; charset="%s"' % encoding
...
If you trying to send file to the web server use Content-Type:application/x-www-form-urlencoded
A nice hack to use cookies with SOAPpy calls
Using Cookies with SOAPpy calls

Categories