Is there a way to process a stream of data from urllib.request.urlopen(req) so that it can be processed in chuncks?
I have a limited size machine and am pulling an API call that is potentially larger than the memory size on my machine and is causing out of memory exceptions.
I am currently performing the following command:
resp = json.loads(urllib.request.urlopen(req).read().decode())
Try processing it line by line which may lower the mem size.
req = urllib.request.urlopen(req)
data = ''
for line in req:
line = line.decode()
data += line
data = json.loads(data)
Related
I am trying to download a dataset from https://datasets.imdbws.com/title.principals.tsv.gz, decompress the contents in my code itself(Python)and write the resulting file(s) onto disk.
To do so I am using the following code snippet.
results = requests.get(config[sourceFiles]['url'])
with open(config[sourceFiles]['downloadLocation']+config[sourceFiles]['downloadFileName'], 'wb') as f_out:
print(config[sourceFiles]['downloadFileName'] + " starting download")
f_out.write(gzip.decompress(results.content))
print(config[sourceFiles]['downloadFileName']+" downloaded successfully")
This code works fine for most zip files however for larger files it gives the following error message.
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 532, in decompress
return f.read()
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 471, in read
uncompress = self._decompressor.decompress(buf, size)
MemoryError
Is there a way to accomplish this without having to download the zip file directly onto disk and decompressing it for actual data.
You can use a streaming request coupled with zlib:
import zlib
import requests
url = 'https://datasets.imdbws.com/title.principals.tsv.gz'
result = requests.get(url, stream=True)
f_out = open("result.txt", "wb")
chunk_size = 1024 * 1024
d = zlib.decompressobj(zlib.MAX_WBITS|32)
for chunk in result.iter_content(chunk_size):
buffer = d.decompress(chunk)
f_out.write(buffer)
buffer = d.flush()
f_out.write(buffer)
f_out.close()
This snippet reads the data chunk by chunk and feeds it to zlib which can handle data streams.
Depending on your connection speed and CPU/disk performance you can test various chunk sizes.
I am trying to download an original image (png format) by url, convert it on the fly (without saving to disc) and save as jpg.
The code is following:
import os
import io
import requests
from PIL import Image
...
r = requests.get(img_url, stream=True)
if r.status_code == 200:
i = Image.open(io.BytesIO(r.content))
i.save(os.path.join(out_dir, 'image.jpg'), quality=85)
It works, but when I try to monitor the download progress (for the future progress bar) with r.iter_content() like this:
r = requests.get(img_url, stream=True)
if r.status_code == 200:
for chunk in r.iter_content():
print(len(chunk))
i = Image.open(io.BytesIO(r.content))
i.save(os.path.join(out_dir, 'image.jpg'), quality=85)
I get this error:
Traceback (most recent call last):
File "E:/GitHub/geoportal/quicklookScrape/temp.py", line 37, in <module>
i = Image.open(io.BytesIO(r.content))
File "C:\Python35\lib\site-packages\requests\models.py", line 736, in content
'The content for this response was already consumed')
RuntimeError: The content for this response was already consumed
So is it possible to monitor the download progress and after get the data itself?
When using r.iter_content(), you need to buffer the results somewhere. Unfortunately, I can't find any examples where the contents get appended to an object in memory--usually, iter_content is used when a file can't or shouldn't be loaded entirely in memory at once. However, you buffer it using a tempfile.SpooledTemporaryFile as described in this answer: https://stackoverflow.com/a/18550652/4527093. This will prevent saving the image to disk (unless the image is larger than the specified max_size). Then, you can create the Image from the tempfile.
import os
import io
import requests
from PIL import Image
import tempfile
buffer = tempfile.SpooledTemporaryFile(max_size=1e9)
r = requests.get(img_url, stream=True)
if r.status_code == 200:
downloaded = 0
filesize = int(r.headers['content-length'])
for chunk in r.iter_content(chunk_size=1024):
downloaded += len(chunk)
buffer.write(chunk)
print(downloaded/filesize)
buffer.seek(0)
i = Image.open(io.BytesIO(buffer.read()))
i.save(os.path.join(out_dir, 'image.jpg'), quality=85)
buffer.close()
Edited to include chunk_size, which will limit the updates to occurring every 1kb instead of every byte.
I have a data export job that reads data from a REST endpoint and then saves the data in a temporary compressed file before being written to S3. This was working for smaller payloads:
import gzip
import urllib2
# Fails when writing too much data at once
def get_data(url, params, fileobj):
request = urllib2.urlopen(url, params)
event_data = request.read()
with gzip.open(fileobj.name, 'wb') as f:
f.write(event_data)
However, as the data size increased I got an error that seems to indicate I'm writing too much data at once:
File "/usr/lib64/python2.7/gzip.py", line 241, in write
self.fileobj.write(self.compress.compress(data))
OverflowError: size does not fit in an int
I tried modifying the code to read from the REST endpoint line-by-line and write each line to the file, but this was incredibly slow, probably because the endpoint isn't setup to handle that.
# Incredibly slow
def get_data(url, params, fileobj):
request = urllib2.urlopen(url, params)
with gzip.open(fileobj.name, 'wb') as f:
for line in request:
f.write(line)
Is there a more efficient way to do this, such as by reading the entire payload at once, like in the first example, but then efficiently reading line-by-line from the data now residing in memory?
Turns out this is what StringIO is for. By turning my payload into a StringIO object I was able to read from it line-by-line and write to a gzipped file without any errors.
from StringIO import StringIO
def get_data(url, params, fileobj):
request = urllib2.urlopen(url, params)
event_data = StringIO(request.read())
with gzip.open(fileobj.name, 'wb') as f:
for line in event_data:
f.write(line)
If I make a request for a file and specify encoding of gzip, how do I handle that?
Normally when I have a large file I do the following:
while True:
chunk = resp.read(CHUNK)
if not chunk: break
writer.write(chunk)
writer.flush()
where the CHUNK is some size in bytes, writer is an open() object and resp is the request response generated from a urllib request.
So it's pretty simple most of the time when the response header contains 'gzip' as the returned encoding, I would do the following:
decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(resp.read())
writer.write(data)
writer.flush()
or this:
f = gzip.GzipFile(fileobj=buf)
writer.write(f.read())
where the buf is a BytesIO().
If I try to decompress the gzip response though, I am getting issues:
while True:
chunk = resp.read(CHUNK)
if not chunk: break
decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(chunk)
writer.write(data)
writer.flush()
Is there a way I can decompress the gzip data as it comes down in little chunks? or do I need to write the whole file to disk, decompress it then move it to the final file name? Part of the issue I have, using 32-bit Python, is that I can get out of memory errors.
Thank you
I think I found a solution that I wish to share.
def _chunk(response, size=4096):
""" downloads a web response in pieces """
method = response.headers.get("content-encoding")
if method == "gzip":
d = zlib.decompressobj(16+zlib.MAX_WBITS)
b = response.read(size)
while b:
data = d.decompress(b)
yield data
b = response.read(size)
del data
else:
while True:
chunk = response.read(size)
if not chunk: break
yield chunk
If anyone has a better solution, please add to it. Basically my error was the creation of the zlib.decompressobj(). I was creating it in the wrong place.
This seems to work in both python 2 and 3 as well, so there is a plus.
I wrote a python script which I am using to download a large number of video files (50-400 MB each) from an HTTP server. It has worked well so far on long lists of downloads, but for some reason it rarely has a memory error.
The machine has about 1 GB of RAM free, but I don't think it's ever maxed out on RAM while running this script.
I've monitored the memory usage in the task manager and perfmon and it always behaves the same from what I've seen: slowly increases during the download, then returns to normal level after it finishes the download (There's no small leaks that creep up or anything like that).
The way the download behaves is that it creates the file, which remains at 0 KB until the download finishes (or the program crashes), then it writes the whole file at once and closes it.
for i in range(len(urls)):
if os.path.exists(folderName + '/' + filenames[i] + '.mov'):
print 'File exists, continuing.'
continue
# Request the download page
req = urllib2.Request(urls[i], headers = headers)
sock = urllib2.urlopen(req)
responseHeaders = sock.headers
body = sock.read()
sock.close()
# Search the page for the download URL
tmp = body.find('/getfile/')
downloadSuffix = body[tmp:body.find('"', tmp)]
downloadUrl = domain + downloadSuffix
req = urllib2.Request(downloadUrl, headers = headers)
print '%s Downloading %s, file %i of %i'
% (time.ctime(), filenames[i], i+1, len(urls))
f = urllib2.urlopen(req)
# Open our local file for writing, 'b' for binary file mode
video_file = open(foldername + '/' + filenames[i] + '.mov', 'wb')
# Write the downloaded data to the local file
video_file.write(f.read()) ##### MemoryError: out of memory #####
video_file.close()
print '%s Download complete!' % (time.ctime())
# Free up memory, in hopes of preventing memory errors
del f
del video_file
Here is the stack trace:
File "downloadVideos.py", line 159, in <module>
main()
File "downloadVideos.py", line 136, in main
video_file.write(f.read())
File "c:\python27\lib\socket.py", line 358, in read
buf.write(data)
MemoryError: out of memory
Your problem is here: f.read(). That line attempts to download the entire file into memory. Instead of that, read in chunks (chunk = f.read(4096)), and save the pieces to temporary file.