Usually, downloading a file from the server is something like this:
fp = open(file, 'wb')
req = urllib2.urlopen(url)
for line in req:
fp.write(line)
fp.close()
During downloading, the download process just has to be finished. If the process is stopped or interrupted, the download process needs to start again. So, I would like to enable my program to pause and resume the download, how do I implement such? Thanks.
The web server must support Range request header to allow pause/resume download:
Range: <unit>=<range-start>-<range-end>
Then the client can make a request with the Range header if he/she wants to retrieve the specified bytes, for example:
Range: bytes=0-1024
In this case the server can respond with a 200 OK indicating that it doesn't support Range requests,
Or it can respond with 206 Partial Content like this:
HTTP/1.1 206 Partial Content
Accept-Ranges: bytes
Content-Length: 1024
Content-Range: bytes 64-512/1024
Response body.... till 512th byte of the file
See:
Range request header
Content-Range response header
Accept-Ranges response header
206 Partial Content
HTTP 1.1 specification
In python, you can do:
import urllib, os
class myURLOpener(urllib.FancyURLopener):
"""Create sub-class in order to overide error 206. This error means a
partial file is being sent,
which is ok in this case. Do nothing with this error.
"""
def http_error_206(self, url, fp, errcode, errmsg, headers, data=None):
pass
loop = 1
dlFile = "2.6Distrib.zip"
existSize = 0
myUrlclass = myURLOpener()
if os.path.exists(dlFile):
outputFile = open(dlFile,"ab")
existSize = os.path.getsize(dlFile)
#If the file exists, then only download the remainder
myUrlclass.addheader("Range","bytes=%s-" % (existSize))
else:
outputFile = open(dlFile,"wb")
webPage = myUrlclass.open("http://localhost/%s" % dlFile)
#If the file exists, but we already have the whole thing, don't download again
if int(webPage.headers['Content-Length']) == existSize:
loop = 0
print "File already downloaded"
numBytes = 0
while loop:
data = webPage.read(8192)
if not data:
break
outputFile.write(data)
numBytes = numBytes + len(data)
webPage.close()
outputFile.close()
for k,v in webPage.headers.items():
print k, "=", v
print "copied", numBytes, "bytes from", webPage.url
You can find the source: http://code.activestate.com/recipes/83208-resuming-download-of-a-file/
It only works for http dls
Related
I'm attempting to pull information from a log file posted online and read through the output. The only information i really need is posted at the end of the file. These files are pretty big and storing the entire socket output to a variable and reading through it is consuming alot of internal memory. is there a was to read the socket from bottom to top?
What I currently have:
socket = urllib.urlopen(urlString)
OUTPUT = socket.read()
socket.close()
OUTPUT = OUTPUT.split("\n")
for line in OUTPUT:
if "xxxx" in line:
print line
I am using Python 2.7. I pretty much want to read about 30 lines from the very end of the output of Socket.
What you want in this use case is the HTTP Range request. Here is tutorial I located:
http://stuff-things.net/2015/05/13/web-scale-http-tail/
I should clarify: the advantage of getting the size with a Head request, then doing a Range request, is that you do not have to transfer all the content. You mentioned you have pretty big file resources, so this is going to be the best solution :)
edit: added this code below...
Here is a demo (simplified) of that blog article, but translated into Python. Please note this will not work with all HTTP servers! More comments inline:
"""
illustration of how to 'tail' a file using http. this will not work on all
webservers! if you need an http server to test with you can try the
rangehttpserver module:
$ pip install requests
$ pip install rangehttpserver
$ python -m RangeHTTPServer
"""
import requests
TAIL_SIZE = 1024
url = 'http://localhost:8000/lorem-ipsum.txt'
response = requests.head(url)
# not all servers return content-length in head, for some reason
assert 'content-length' in response.headers, 'Content length unknown- out of luck!'
# check the the resource length and construct a request header for that range
full_length = int(response.headers['content-length'])
assert full_length > TAIL_SIZE
headers = {
'range': 'bytes={}-{}'.format( full_length - TAIL_SIZE, full_length)
}
# Make a get request, with the range header
response = requests.get(url, headers=headers)
assert 'accept-ranges' in response.headers, 'Accept-ranges response header missing'
assert response.headers['accept-ranges'] == 'bytes'
assert len(response.text) == TAIL_SIZE
# Otherwise you get the entire file
response = requests.get(url)
assert len(response.text) == full_length
I am crawling the web using urllib3. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.
I found this question - Link - and it suggests using urllib and urlopen. I don't want to contact the server twice.
I want to limit the file size to 25MB.
Is there a way i can do this with urllib3?
If the server supplies a Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.
To do this, you'll need to make sure that you're not preloading the full response.
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url, preload_content=False)
# Maximum amount we want to read
max_bytes = 1000000
content_bytes = response.headers.get("Content-Length")
if content_bytes and int(content_bytes) < max_bytes:
# Expected body is smaller than our maximum, read the whole thing
data = response.read()
# Do something with data
...
elif content_bytes is None:
# Alternatively, stream until we hit our limit
amount_read = 0
for chunk in r.stream():
amount_read += len(chunk)
# Save chunk
...
if amount_read > max_bytes:
break
# Release the connection back into the pool
response.release_conn()
I'm currently using Python requests for HTTP requests, but due to limitations in the API, I'm unable to keep using the library.
I need a library which will allow me to write the request body in a streaming file-like fashion, as the data which I'll be sending won't all be immediately available, plus I'd like to save as much memory as possible when making a request. Is there an easy-to-use library which will allow me to send a PUT request like this:
request = HTTPRequest()
request.headers['content-type'] = 'application/octet-stream'
# etc
request.connect()
# send body
with open('myfile', 'rb') as f:
while True:
chunk = f.read(64 * 1024)
request.body.write(chunk)
if not len(chunk) == 64 * 1024:
break
# finish
request.close()
More specifically, I have one thread to work with. Using this thread, I receive callbacks as I receive a stream over the network. Essentially, those callbacks look like this:
class MyListener(Listener):
def on_stream_start(stream_name):
pass
def on_stream_chunk(chunk):
pass
def on_stream_end(total_size):
pass
I need to essentially create my upload request in the on_stream_start method, upload chunks in the on_stream_chunk method, then finish the upload in the on_stream_end method. Thus, I need a library which supports a method like write(chunk) to be able to do something similar to the following:
class MyListener(Listener):
request = None
def on_stream_start(stream_name):
request = RequestObject(get_url(), "PUT")
request.headers.content_type = "application/octet-stream"
# ...
def on_stream_chunk(chunk):
request.write_body(chunk + sha256(chunk).hexdigest())
def on_stream_end(total_size):
request.close()
The requests library supports file-like objects and generators for reading but nothing for writing out the requests: pull instead of push. Is there a library which will allow me to push data up the line to the server?
As far as I can tell httplib's HTTPConnection.request does exactly what you want.
I tracked down the function which actually does the sending, and as long as you're passing a file-like object (and not a string), it chunks it up:
Definition: httplib.HTTPConnection.send(self, data)
Source:
def send(self, data):
"""Send `data' to the server."""
if self.sock is None:
if self.auto_open:
self.connect()
else:
raise NotConnected()
if self.debuglevel > 0:
print "send:", repr(data)
blocksize = 8192
if hasattr(data,'read') and not isinstance(data, array):
if self.debuglevel > 0: print "sendIng a read()able"
## {{{ HERE IS THE CHUCKING LOGIC
datablock = data.read(blocksize)
while datablock:
self.sock.sendall(datablock)
datablock = data.read(blocksize)
## }}}
else:
self.sock.sendall(data)
I do something like this in a few places in my codebase. You need an upload file wrapper, and you need another thread or a greenthread - I'm using eventlet for fake threading in my instance. Call requests.put, which will block on read() on your file-like object wrapper. The thread you call put in will block waiting, so you need to do the receiving in another.
Sorry for not posting code, I just saw this when I was zipping through. I hope this is enough to help, if not maybe I can edit and add more later.
Requests actually supports multipart encoded requests with the files parameter:
Multipart POST example in the official documentation:
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)
r.text
{
...
"files": {
"file": "<censored...binary...data>"
},
...
}
You can create your own file-like streaming object if you like, too, but you cannot mix a stream and files in the same request.
A simple case that might work for you would be to open the file and return a chunking, generator-based reader:
def read_as_gen(filename, chunksize=-1): # -1 defaults to read the file to the end, like a regular .read()
with open(filename, mode='rb') as f:
while True:
chunk = f.read(chunksize)
if len(chunk) > 0:
yield chunk
else:
raise StopIteration
# Now that we can read the file as a generator with a chunksize, give it to the files parameter
files = {'file': read_as_gen(filename, 64*1024)}
# ... post as normal.
But if you had to block the chunking on something else, like another network buffer, you could handle that in the same manner:
def read_buffer_as_gen(buffer_params, chunksize=-1): # -1 defaults to read the file to the end, like a regular .read()
with buffer_open(*buffer_params) as buf: # some function to open up your buffer
# you could also just pass in the buffer itself and skip the `with` block
while True:
chunk = buf.read(chunksize)
if len(chunk) > 0:
yield chunk
else:
raise StopIteration
This may help
import urllib2
request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)
I'm playing around trying to write a client for a site which provides data as an HTTP stream (aka HTTP server push). However, urllib2.urlopen() grabs the stream in its current state and then closes the connection. I tried skipping urllib2 and using httplib directly, but this seems to have the same behaviour.
The request is a POST request with a set of five parameters. There are no cookies or authentication required, however.
Is there a way to get the stream to stay open, so it can be checked each program loop for new contents, rather than waiting for the whole thing to be redownloaded every few seconds, introducing lag?
You could try the requests lib.
import requests
r = requests.get('http://httpbin.org/stream/20', stream=True)
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print line
You also could add parameters:
import requests
settings = { 'interval': '1000', 'count':'50' }
url = 'http://agent.mtconnect.org/sample'
r = requests.get(url, params=settings, stream=True)
for line in r.iter_lines():
if line:
print line
Do you need to actually parse the response headers, or are you mainly interested in the content? And is your HTTP request complex, making you set cookies and other headers, or will a very simple request suffice?
If you only care about the body of the HTTP response and don't have a very fancy request, you should consider simply using a socket connection:
import socket
SERVER_ADDR = ("example.com", 80)
sock = socket.create_connection(SERVER_ADDR)
f = sock.makefile("r+", bufsize=0)
f.write("GET / HTTP/1.0\r\n"
+ "Host: example.com\r\n" # you can put other headers here too
+ "\r\n")
# skip headers
while f.readline() != "\r\n":
pass
# keep reading forever
while True:
line = f.readline() # blocks until more data is available
if not line:
break # we ran out of data!
print line
sock.close()
One way to do it using urllib2 is (assuming this site also requires Basic Auth):
import urllib2
p_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
url = 'http://streamingsite.com'
p_mgr.add_password(None, url, 'login', 'password')
auth = urllib2.HTTPBasicAuthHandler(p_mgr)
opener = urllib2.build_opener(auth)
urllib2.install_opener(opener)
f = opener.open('http://streamingsite.com')
while True:
data = f.readline()
I currently use WebFaction for my hosting with the basic package that gives us 80MB of RAM. This is more than adequate for our needs at the moment, apart from our backups. We do our own backups to S3 once a day.
The backup process is this: dump the database, tar.gz all the files into one backup named with the correct date of the backup, upload to S3 using the python library provided by Amazon.
Unfortunately, it appears (although I don't know this for certain) that either my code for reading the file or the S3 code is loading the entire file in to memory. As the file is approximately 320MB (for today's backup) it is using about 320MB just for the backup. This causes WebFaction to quit all our processes meaning the backup doesn't happen and our site goes down.
So this is the question: Is there any way to not load the whole file in to memory, or are there any other python S3 libraries that are much better with RAM usage. Ideally it needs to be about 60MB at the most! If this can't be done, how can I split the file and upload separate parts?
Thanks for your help.
This is the section of code (in my backup script) that caused the processes to be quit:
filedata = open(filename, 'rb').read()
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(filedata), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
It's a little late but I had to solve the same problem so here's my answer.
Short answer: in Python 2.6+ yes! This is because the httplib supports file-like objects as of v2.6. So all you need is...
fileobj = open(filename, 'rb')
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(fileobj), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
Long answer...
The S3.py library uses python's httplib to do its connection.put() HTTP requests. You can see in the source that it just passes the data argument to the httplib connection.
From S3.py...
def _make_request(self, method, bucket='', key='', query_args={}, headers={}, data='', metadata={}):
...
if (is_secure):
connection = httplib.HTTPSConnection(host)
else:
connection = httplib.HTTPConnection(host)
final_headers = merge_meta(headers, metadata);
# add auth header
self._add_aws_auth_header(final_headers, method, bucket, key, query_args)
connection.request(method, path, data, final_headers) # <-- IMPORTANT PART
resp = connection.getresponse()
if resp.status < 300 or resp.status >= 400:
return resp
# handle redirect
location = resp.getheader('location')
if not location:
return resp
...
If we take a look at the python httplib documentation we can see that...
HTTPConnection.request(method, url[, body[, headers]])
This will send a request to the server using the HTTP request method method and the selector url. If the body argument is present, it should be a string of data to send after the headers are finished. Alternatively, it may be an open file object, in which case the contents of the file is sent; this file object should support fileno() and read() methods. The header Content-Length is automatically set to the correct value. The headers argument should be a mapping of extra HTTP headers to send with the request.
Changed in version 2.6: body can be a file object.
don't read the whole file into your filedata variable. you could use a loop and then just read ~60 MB and submit them to amazon.
backup = open(filename, 'rb')
while True:
part_of_file = backup.read(60000000) # not exactly 60 MB....
response = connection.put() # submit part_of_file here to amazon