Is there a way to download huge and still growing file over HTTP using the partial-download feature?
It seems that this code downloads file from scratch every time it executed:
import urllib
urllib.urlretrieve ("http://www.example.com/huge-growing-file", "huge-growing-file")
I'd like:
To fetch just the newly-written data
Download from scratch only if the source file becomes smaller (for example it has been rotated).
It is possible to do partial download using the range header, the following will request a selected range of bytes:
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, end)
f = urllib2.urlopen(req)
For example:
>>> req = urllib2.Request('http://www.python.org/')
>>> req.headers['Range'] = 'bytes=%s-%s' % (100, 150)
>>> f = urllib2.urlopen(req)
>>> f.read()
'l1-transitional.dtd">\n\n\n<html xmlns="http://www.w3.'
Using this header you can resume partial downloads. In your case all you have to do is to keep track of already downloaded size and request a new range.
Keep in mind that the server need to accept this header for this to work.
This is quite easy to do using TCP sockets and raw HTTP. The relevant request header is "Range".
An example request might look like:
mysock = connect(("www.example.com", 80))
mysock.write(
"GET /huge-growing-file HTTP/1.1\r\n"+\
"Host: www.example.com\r\n"+\
"Range: bytes=XXXX-\r\n"+\
"Connection: close\r\n\r\n")
Where XXXX represents the number of bytes you've already retrieved. Then you can read the response headers and any content from the server. If the server returns a header like:
Content-Length: 0
You know you've got the entire file.
If you want to be particularly nice as an HTTP client you can look into "Connection: keep-alive". Perhaps there is a python library that does everything I have described (perhaps even urllib2 does it!) but I'm not familiar with one.
If I understand your question correctly, the file is not changing during download, but is updated regularly. If that is the question, rsync is the answer.
If the file is being updated continually including during download, you'll need to modify rsync or a bittorrent program. They split files into separate chunks and download or update the chunks independently. When you get to the end of the file from the first iteration, repeat to get the appended chunk; continue as necessary. With less efficiency, one could just repeatedly rsync.
Related
I'm writing a python script to parse jenkins job results. I'm using urllib2 to fetch consoleText, but the file that I receive isn't full. The code to fetch the file is:
data = urllib2.urlopen('http://<server>/job/<jobname>/<buildid>/consoleText')
lines = data.readlines()
And the number of lines I get is 2306, while the actual number of lines in the console log is 37521. I can check that buy fetching the file via wget:
$ wget 'http://<server>/job/<jobname>/<buildid>/consoleText'
$ wc -l consoleText
37521
Why does urlopen not give me the full result?
UPDATE:
Using requests (as suggested by #svrist) instead of urllib2 doesn't have such a problem, so I'm switching to it. My new code is:
data = requests.get('http://<server>/job/<jobname>/<buildid>/consoleText')
lines = [l for l in data.iter_lines()]
But I still have no idea why urllib2.urlopen doesn't work properly.
The Jenkins build log is returned using a chunked encoding response.
Transfer-Encoding: chunked
Based on a couple of other questions, it seems like urllib2 does not handle the entire response and as you've observed, only returns the first chunk.
I also recommend using the requests package.
I'm quite new at using python for http request and so far I have a script
that is fetching XML files from a long list of URL's (on the same server) to then extract data from nodes with lxml.
Everything works fine however i'm a bit concern about the huge number of request the host server might receive from me.
Is there a way using "request" to send only one request to the server that will fetch all the XML from the different URL and store them in a tar.gz file?
Here what is doing my script so far (with a small sample):
IDlist = list(accession_clean)
URLlist = ['http://www.uniprot.org/uniprot/Q13111.xml', 'http://www.uniprot.org/uniprot/A2A2F0.xml', 'http://www.uniprot.org/uniprot/G5EA09.xml', 'http://www.uniprot.org/uniprot/Q8IY37.xml', 'http://www.uniprot.org/uniprot/O14545.xml', 'http://www.uniprot.org/uniprot/O00308.xml', 'http://www.uniprot.org/uniprot/Q13136.xml', 'http://www.uniprot.org/uniprot/Q86UT6.xml']
for id, item in zip(IDlist, URLlist):
try:
textfile = urllib2.urlopen(item);
except urllib2.HTTPError:
print("URL", item, "could not be read.")
continue
try:
tree = etree.parse(textfile);
except lxml.etree.XMLSyntaxError:
print 'Skipping invalid XML from URL {}'.format(item)
continue
That website offers an API which is documented here, although it will not give you a tar.gz file, it is possible to retrieve multiple entries with a single HTTP request.
Perhaps one of the batch or query methods will work for you.
I can get a file that has content-encoding as gzip.
So does that mean that the server is storing it as compressed file or it is also true for files stored as compressed zip or 7z files too?
and if so (where durl is a zip file)
>>> durl = 'https://db.tt/Kq0byWzW'
>>> dresp = requests.get(durl, allow_redirects=True, stream=True)
>>> dresp.headers['content-encoding']
'gzip'
>>> r = requests.get(durl, stream=True)
>>> data = r.raw.read(decode_content=True)
but data is coming out to be empty while I want to extract the zip file to disk on the go !!
So first of all durl is not a zip file, it is a drop box landing page. So what you are looking at is HTML which is being sent using gzip encoding. If you where to decode the data from the raw socket using gzip you would simply get the HTML. So the use of raw is really just hiding that you accidentally go an other file than the one you thought.
Based on https://plus.google.com/u/0/100262946444188999467/posts/VsxftxQnRam where you ask
Does anyone has any idea about writing compressed file directy to disk to decompressed state?
I take it you are really trying to fetch a zip and decompress it directly to a directory without first storing it. To do this you need to use https://docs.python.org/2/library/zipfile.html
Though at this point the problem becomes that the response from requests isn't actually seekable, which zipfile requires in order to work (one of the first things it will do is seek to the end of the file to determine how long it is).
To get around this you need to wrap the response in a file like object. Personally I would recommend using tempfile.SpooledTemporaryFile with a max size set. This way your code would switch to writing things to disk if the file was bigger than you expected.
import requests
import tempfile
import zipfile
KB = 1<<10
MB = 1<<20
url = '...' # Set url to the download link.
resp = requests.get(url, stream=True)
with tmp as tempfile.SpooledTemporaryFile(max_size=500*MB):
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)
Same code using io.BytesIO:
resp = requests.get(url, stream=True)
tmp = io.BytesIO()
for chunk in resp.iter_content(4*KB):
tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)
You need the content from the requests file to write it.
Confirmed working:
import requests
durl = 'https://db.tt/Kq0byWzW'
dresp = requests.get(durl, allow_redirects=True, stream=True)
dresp.headers['content-encoding']
file = open('test.html', 'w')
file.write(dresp.text)
You have to differentiate between content-encoding (not to be confused with transfer-encoding) and content-type.
The gist of it is that content-type is the media-type (the real file-type) of the resource you are trying to get. And content-encoding is any kind of modification applied to it before sending it to the client.
So let's assume you'd like to get a resource named "foo.txt". It will probably have a content-type of text/plain.In addition to that, the data can be modified when sending over the wire. This is the content-encoding. So, with the above example, you can have a content-type of text/plain and a content-encoding of gzip. This means that before the server sends the file out onto the wire, it will compress it using gzip on the fly. So the only bytes which traverse the net are zipped. Not the raw-bytes of the original file (foo.txt).
It is the job of the client to process these headers accordingly.
Now, I am not 100% sure if requests, or the underlying python libs do this but chances are they do. If not, Python ships with a default gzip library, so you could do it on your own without a problem.
With the above in mind, to respond to your question: No, having a "content-encoding" of gzip does not mean that the remote resource is a zip-file. The field containing that information is content-type (based on your question this has probably a value of application/zip or application/x-7z-compressed depending of actual compression algorithm used).
If you cannot determine the real file-type based on the content-type field (f.ex. if it is application/octet-stream), you could just save the file to disk, and open it up with a hex editor. In the case of a 7z file you should see the byte sequence 37 7a bc af 27 1c somewhere. Most likely at the beginning of the file or at EOF-112 bytes. In the case of a gzip file, it should be 1f 8b at the beginning of the file.
Given that you have gzip in the content-encoding field: If you get a 7z file, you can be certain that requests has parsed content-encoding and properly decoded it for you. If you get a gzip file, it could mean two things. Either requests has not decoded anything, of the file is indeed a gzip file, as it could be a gzip file sent with the gzip encoding. Which would mean that it's doubly compressed. This would not make any sense, but, depending on the server this could still happen.
You could simply try to run gunzip on the console and see what you get.
I am looking for a simple way to save a csv file originating from a published Google Sheets document? Since it's published, it's accessible through a direct link (modified on purpose in the example below).
All my browsers will prompt me to save the csv file as soon as I launch the link.
Neither:
DOC_URL = 'https://docs.google.com/spreadsheet/ccc?key=0AoOWveO-dNo5dFNrWThhYmdYW9UT1lQQkE&output=csv'
f = urllib.request.urlopen(DOC_URL)
cont = f.read(SIZE)
f.close()
cont = str(cont, 'utf-8')
print(cont)
, nor:
req = urllib.request.Request(DOC_URL)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1284.0 Safari/537.13')
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
print anything but html content.
(Tried the 2nd version after reading this other post: Download google docs public spreadsheet to csv with python .)
Any idea on what I am doing wrong? I am logged out of my Google account, if that worths to anything, but this works from any browser that I tried. As far as I understood, the Google Docs API is not yet ported on Python 3 and given the "toy" magnitude of my little project for personal use, it would not even make too much sense to use it from the get-go, if I can circumvent it.
In the 2nd attempt, I left the 'User-Agent', as I was thinking that maybe requests thought as coming from scripts (b/c no identification info is present) might be ignored, but it didn't make a difference.
While the requests library is the gold standard for HTTP requests from Python, this style of download is (while not deprecated yet) not likely to last, specifically referring to the use of links, managing cookies & redirects, etc. One of the reasons for not preferring links is that it's less secure and generally such access should require authorization. Instead, the currently accepted way of exporting Google Sheets as CSV is by using the Google Drive API.
So why the Drive API? Isn't this supposed to be something for the Sheets API instead? Well, the Sheets API is for spreadsheet-oriented functionality, i.e., data formatting, column resize, creating charts, cell validation, etc., while the Drive API is for file-oriented functionality, i.e., import/export, copy, rename, etc.
Below is a complete cmd-line solution. (If you don't do Python, you can use it as pseudocode and pick any language supported by the Google APIs Client Libraries.) For the code snippet, assume the most current Sheet named inventory (older files with that name are ignored) and DRIVE is the API service endpoint:
FILENAME = 'inventory'
SRC_MIMETYPE = 'application/vnd.google-apps.spreadsheet'
DST_MIMETYPE = 'text/csv'
# query for latest file named FILENAME
files = DRIVE.files().list(
q='name="%s" and mimeType="%s"' % (FILENAME, SRC_MIMETYPE),
orderBy='modifiedTime desc,name').execute().get('files', [])
# if found, export Sheets file as CSV
if files:
fn = '%s.csv' % os.path.splitext(files[0]['name'].replace(' ', '_'))[0]
print('Exporting "%s" as "%s"... ' % (files[0]['name'], fn), end='')
data = DRIVE.files().export(fileId=files[0]['id'], mimeType=DST_MIMETYPE).execute()
# if non-empty file
if data:
with open(fn, 'wb') as f:
f.write(data)
print('DONE')
If your Sheet is large, you may have to export it in chunks -- see this page on how to do that. If you're generally new to Google APIs, I have a (somewhat dated but) user-friendly intro video for you. (There are 2 videos after that which maybe useful too.)
Google responds to the initial request with a series of cookie-setting 302 redirects. If you don't store and resubmit the cookies between requests, it redirects you to the login page.
So, the problem is not with the User-Agent header, it's the fact that by default, urllib.request.urlopen doesn't store cookies, but it will follow the HTTP 302 redirects.
The following code works just fine on a public spreadsheet available at the location specified by DOC_URL:
>>> from http.cookiejar import CookieJar
>>> from urllib.request import build_opener, HTTPCookieProcessor
>>> opener = build_opener(HTTPCookieProcessor(CookieJar()))
>>> resp = opener.open(DOC_URL)
>>> # should really parse resp.getheader('content-type') for encoding.
>>> csv_content = resp.read().decode('utf-8')
Having shown you how to do it in vanilla python, I'll now say that the Right Way™ to go about this is to use the most-excellent requests library. It is extremely well documented and makes these sorts of tasks incredibly pleasant to complete.
For instance, to get the same csv_content as above using the requests library is as simple as:
>>> import requests
>>> csv_content = requests.get(DOC_URL).text
That single line expresses your intent more clearly. It's easier to write and easier to read. Do yourself - and anyone else who shares your codebase - a favor and just use requests.
I have a program where I need to open many webpages and download information in them. The information, however, is in the middle of the page, and it takes a long time to get to it. Is there a way to have urllib only retrieve x lines? Or, if nothing else, don't load the information afterwards?
I'm using Python 2.7.1 on Mac OS 10.8.2.
The returned object is a file-like object, and you can use .readline() to only read a partial response:
resp = urllib.urlopen(url)
for i in range(10):
line = resp.readline()
would read only 10 lines, for example. Note that this won't guarantee a faster response.