downloading a large file in chunks with gzip encoding (Python 3.4) - python

If I make a request for a file and specify encoding of gzip, how do I handle that?
Normally when I have a large file I do the following:
while True:
chunk = resp.read(CHUNK)
if not chunk: break
writer.write(chunk)
writer.flush()
where the CHUNK is some size in bytes, writer is an open() object and resp is the request response generated from a urllib request.
So it's pretty simple most of the time when the response header contains 'gzip' as the returned encoding, I would do the following:
decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(resp.read())
writer.write(data)
writer.flush()
or this:
f = gzip.GzipFile(fileobj=buf)
writer.write(f.read())
where the buf is a BytesIO().
If I try to decompress the gzip response though, I am getting issues:
while True:
chunk = resp.read(CHUNK)
if not chunk: break
decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(chunk)
writer.write(data)
writer.flush()
Is there a way I can decompress the gzip data as it comes down in little chunks? or do I need to write the whole file to disk, decompress it then move it to the final file name? Part of the issue I have, using 32-bit Python, is that I can get out of memory errors.
Thank you

I think I found a solution that I wish to share.
def _chunk(response, size=4096):
""" downloads a web response in pieces """
method = response.headers.get("content-encoding")
if method == "gzip":
d = zlib.decompressobj(16+zlib.MAX_WBITS)
b = response.read(size)
while b:
data = d.decompress(b)
yield data
b = response.read(size)
del data
else:
while True:
chunk = response.read(size)
if not chunk: break
yield chunk
If anyone has a better solution, please add to it. Basically my error was the creation of the zlib.decompressobj(). I was creating it in the wrong place.
This seems to work in both python 2 and 3 as well, so there is a plus.

Related

How to save image which sent via flask send_file

I have this code for server
#app.route('/get', methods=['GET'])
def get():
return send_file("token.jpg", attachment_filename=("token.jpg"), mimetype='image/jpg')
and this code for getting response
r = requests.get(url + '/get')
And i need to save file from response to hard drive. But i cant use r.files. What i need to do in these situation?
Assuming the get request is valid. You can use use Python's built in function open, to open a file in binary mode and write the returned content to disk. Example below.
file_content = requests.get('http://yoururl/get')
save_file = open("sample_image.png", "wb")
save_file.write(file_content.content)
save_file.close()
As you can see, to write the image to disk, we use open, and write the returned content to 'sample_image.png'. Since your server-side code seems to be returning only one file, the example above should work for you.
You can set the stream parameter and extract the filename from the HTTP headers. Then the raw data from the undecoded body can be read and saved chunk by chunk.
import os
import re
import requests
resp = requests.get('http://127.0.0.1:5000/get', stream=True)
name = re.findall('filename=(.+)', resp.headers['Content-Disposition'])[0]
dest = os.path.join(os.path.expanduser('~'), name)
with open(dest, 'wb') as fp:
while True:
chunk = resp.raw.read(1024)
if not chunk: break
fp.write(chunk)

python sockets send file function explanation

I am trying to learn python socket programming (Networks) and I have a function that sends a file but I am not fully understanding each line of this function (can't get my head around it). Could someone please explain line by line what this is doing. I am also unsure why it needs the length of the data to send the file. Also if you can see any improvements to this function as well, thanks.
def send_file(socket, filename):
with open(filename, "rb") as x:
data = x.read()
length_data = len(data)
try:
socket.sendall(length_data.to_bytes(20, 'big'))
socket.sendall(data)
except Exception as error:
print("Error", error)
This protocol is written to first send the size of the file and then its data. The size is 20 octets (bytes) serialized as "big endian", meaning the most significant byte is sent first. Suppose you had a 1M file,
>>> val = 1000000
>>> val.to_bytes(20, 'big')
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0fB#'
The program sends those 20 bytes and then the file payload itself. The receiver would read the 20 octets, convert that back into a file size and would know exactly how many more octets to receive for the file. TCP is a streaming protocol. It has no idea of message boundaries. Protocols need some way of telling the other side how much data makes up a message and this is a common way to do it.
As an aside, this code has the serious problem that it reads the entire file in one go. Suppose it was huge, this code would crash.
A receiver would look something like the following. This is a rudimentary.
import io
def recv_exact(skt, count):
buf = io.BytesIO()
while count:
data = skt.recv(count)
if not data:
return b"" # socket closed
buf.write(data)
count -= len(data)
return buf.getvalue()
def recv_file(skt):
data = recv_exact(skt, 20)
if not data:
return b""
file_size = int.from_bytes(data, "big")
file_bytes = recv_exact(skt, file_size)
return file_bytes

Immediate write JSON API response to file with Python Requests

I am trying to retrieve data from an API and immediate write the JSON response directly to a file and not store any part of the response in memory. The reason for this requirement is because I'm executing this script on a AWS Linux EC2 that only has 2GB of memory, and if I try to hold everything in memory and then write the responses to a file, the process will fail due to not enough memory.
I've tried using f.write() as well as sys.stdout.write(), but both of these approaches seemed to only write the file after all the queries were executed. While this worked with my small example, it didn't work when dealing with my actual data.
The problem with both approaches below is that the file doesn't populate until the loop is complete. This will not work with my actual process, as the machine doesn't have enough memory to hold the all the responses in memory.
How can I adapt either of the approaches below, or come up with something new, to write data received from the API immediately to a file without saving anything in memory?
Note: I'm using Python 3.7 but happy to update if there is something that would make this easier.
My Approach 1
# script1.py
import requests
import json
with open('data.json', 'w') as f:
for i in range(0, 100):
r = requests.get("https://httpbin.org/uuid")
data = r.json()
f.write(json.dumps(data) + "\n")
f.close()
My Approach 2
# script2.py
import request
import json
import sys
for i in range(0, 100):
r = requests.get("https://httpbin.org/uuid")
data = r.json()
sys.stdout.write(json.dumps(data))
sys.stdout.write("\n")
With approach 2, I tried using the > to redirect the output to a file:
script2.py > data.json
You can use response.iter_content to download the content in chunks. For example:
import requests
url = 'https://httpbin.org/uuid'
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open('data.json', 'wb') as f_out:
for chunk in r.iter_content(chunk_size=8192):
f_out.write(chunk)
Saves data.json with content:
{
"uuid": "991a5843-35ca-47b3-81d3-258a6d4ce582"
}

Sending back a file stream from GRPC Python Server

I have a service that needs to return a filestream to the calling client so I have created this proto file.
service Sample {
rpc getSomething(Request) returns (stream Response){}
}
message Request {
}
message Response {
bytes data = 1;
}
When the server receives this, it needs to read some source.txt file and then write it back to the client
as a byte stream. Just would like to ask is this the proper way to do this in a Python GRPC server?
fileName = "source.txt"
with open(file_name, 'r') as content_file:
content = content_file.read()
response.data = content.encode()
yield response
I cannot find any examples related to this.
That looks mostly correct, though it's hard to be sure since you haven't shared with us all of your service-side code. A few tweaks I'd suggest would be (1) reading the file as binary content in the first place, (2) exiting the with statement as early as possible, (3) constructing the response message only after you've constructed the value of its data field, and (4) making a module-scope module-private constant out of the file name. Something like:
with open(_CONTENT_FILE_NAME, 'rb') as content_file:
content = content_file.read()
yield my_generated_module_pb2.Response(data=content)
. What do you think?
One option would be to lazily read in the binary and yield each chunk. Note, this is untested code:
def read_bytes(file_, num_bytes):
while True:
bin = file_.read(num_bytes)
if len(bin) != num_bytes:
break
yield bin
class ResponseStreamer(Sample_pb2_grpc.SampleServicer):
def getSomething(request, context):
with open('test.bin', 'rb') as f:
for rec in read_bytes(f, 4):
yield Sample_pb2.Response(data=rec)
Downside is that you'll have the file opened while the stream is open.

Python: Download CSV file, check return code?

I am downloading multiple CSV files from a website using Python. I would like to be able to check the response code on each request.
I know how to download the file using wget, but not how to check the response code:
os.system('wget http://example.com/test.csv')
I've seen a lot of people suggesting using requests, but I'm not sure that's quite right for my use case of saving CSV files.
r = request.get('http://example.com/test.csv')
r.status_code # 200
# Pipe response into a CSV file... hm, seems messy?
What's the neatest way to do this?
You can use the stream argument - along with iter_content() it's possible to stream the response contents right into a file (docs):
import requests
r = None
try:
r = requests.get('http://example.com/test.csv', stream=True)
with open('test.csv', 'w') as f:
for data in r.iter_content():
f.write(data)
finally:
if r is not None:
r.close()

Categories