Reading RestResponse in Chunks - python

To avoid MemoryError's in Python, I am trying to read in chunks. Been searching for half a day on how to read chunks form a RestResponse but to no avail.
The source is a file-like object using the Dropbox SDK for python.
Here's my attempt:
import dropbox
from filechunkio import FileChunkIO
import math
file_and_metadata = dropbox_client.metadata(path)
hq_file = dropbox_client.get_file(file_and_metadata['path'])
source_size = file_and_metadata['bytes']
chunk_size = 4194304
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(hq_file, 'r', offset=offset,
bytes=bytes) as fp:
with open('tmp/testtest123.mp4', 'wb') as f:
f.write(fp)
f.flush()
This results in "TypeError: coercing to Unicode: need string or buffer, RESTResponse found"
Any clues or solutions would be greatly appreciated.

Without knowing anything about FileChunkIO, or even knowing where your code is raising an exception, it's hard to be sure, but my guess is that it needs a real file-like object. Or maybe it does something silly, like checking the type so it can decide whether you're looking to chunk up a string or chunk up a file.
Anyway, according to the docs, RESTResponse isn't a full file-like object, but it implements read and close. And you can easily chunk something that implements read without any fancy wrappers. File-like objects' read methods are guaranteed to return b'' when you get to EOF, and can return fewer bytes than you asked for, so you don't need to guess how many times you need to read and do a short read at the end. Just do this:
chunk_size = 4194304
with open('tmp/testtest123.mp4', 'wb') as f:
while True:
buf = hq_file.read(chunk_size)
if not buf:
break
f.write(buf)
(Notice that I moved the open outside of the loop. Otherwise, for each chunk, you're going to open and empty out the file, then write the next chunk, so at the end you'll end up with just the last one.)
If you want a chunking wrapper, there's a perfectly good builtin function, iter, that can do it for you:
chunk_size = 4194304
chunks = iter(lambda: hq_file.read(chunk_size), '')
with open('tmp/testtest123.mp4', 'wb') as f:
f.writelines(chunks)
Note that the exact same code works in Python 3.x if you change that '' to b'', but that breaks Python 2.5.
This might be a bit of an abuse of writelines, because we're writing an iterable of strings that aren't actually lines. If you don't like it, an explicit loop is just as simple and not much less concise.
I usually write that as partial(hq_file.read, chunk_size) rather than lambda: hq_file.read(chunk_size_, but it's really a matter of preference; read the docs on partial and you should be able to understand why they ultimately have the same effect, and decide which one you prefer.

Related

Can someone please explain how this function works

I found this function when looking up how to count lines in a file,
but i have no idea how it works.
def _count_generator(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024 * 1024)
with open('test.txt', 'rb') as fp:
c_generator = _count_generator(fp.raw.read)
# count each new line
count = sum(buffer.count(b'\n') for buffer in c_generator)
print('total lines', count + 1)
I understand that its reading it as a byte object, but i dont understand
what the reader(1024 * 1024) does or how exactly the whole thing works
Any help is appreciated
Thanks.
open() returns a file object. Since it's opening the file with rb (read-binary), it returns a io.BufferedReader. The underlying raw buffer can be retrieved via the .raw property, which is a RawIOBase - its method, RawIOBase.read, is passed to _count_generator.
Since _count_generator is a generator it is an iterable. Its purpose is to read 1mb of data in the file and yield that data back to the caller on every invocation until the file is over - when the buffer b is done reader() returns 0 bytes, stopping the loop.
The caller uses that 1mb of data and counts the amount of new lines in it via sum function, over and over again, until the file is exhausted.
tl;dr You are reading a file 1mb at a time and summing its newlines. Why? Because more likely than not you cannot open the entire file since it's too large to be opened all at once in memory.
Let's start with the argument to the function. fp.raw.read is the read method of the raw reader of the binary file fp. The read method accepts an integer that tells it how many bytes to read. It returns an empty bytes on EOF.
The function itself is a generator. It lazily calls read to get up to 1MB of data at a time. The chunks are not read until requested by the generator in sum, which counts newlines. Raw read with a positive integer argument will only make one call to the underlying OS, so 1MB is just a hint in this case: most of the time it will read one disk block, usually around 4KB or so.
This program has two immediately apparent flaws if you take the time to read the documentation.
raw is not guaranteed to exist in every implementation of python:
This is not part of the BufferedIOBase API and may not exist on some implementations.
read in non-blocking mode can return None when no data is available but EOF has not been reached. Only empty bytes indicates EOF, so the while loop should be while b != b'':.

How can I load a file with buffers in python?

hope you are having a great day!
In my recent ventures with Python 3.8.5 I have come across a dilemma I must say...
Being that I am a fairly new programmer I am afraid that I don't have the technical knowledge to load a single (BIG) file into the program.
To make my question much more understandable lets look at this down below:
Lets say that there is a file on my system called "File.mp4" or "File.txt" (1GB in size);
I want to load this file into my program using the open function as rb;
I declared a buffer size of 1024;
This is the part I don't know how to solve
I load 1024 worth of bytes into the program
I do whatever I need to do with it
I then load another 1024 bytes in the place of the old buffer
Rinse and repeat until the whole file has been ran trough.
I looked at this question but either it is not good for my case or I just don't know how to implement it -> link to the question
This is the whole code you requested:
BUFFER = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(BUFFER)) != '':
print(list(chunk))
You can use buffered input from io with bytearray:
import io
buf = bytearray(1024)
with io.open(filename, 'rb') as fp:
size = fp.readinto(buf)
if not size:
break
# do things with buf considering the size
This is one of the situations that python 3.8's new walrus operator - which both assigns a value to a variable, and returns the value that it just assigned - is really good for. You can use file.read(size) to read in 1024-byte chunks, and simply stop when there's no more file left to read:
buffer_size = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(buffer_size)) != b'':
# do things with the variable `chunk`, which should have len() == 1024
Note that the != b'' part of the condition can be safely removed, as the empty string will evaluate to False when used as a boolean expression.

How can I truncate an mp3 audio file by 30%?

I am trying to truncate an audio file by 30%, if the audio file was 4 minutes long, after truncating it, it should be around 72 seconds. I have written the code below to do it but it only returns a 0 byte file size. Please tell me where i went wrong?
def loadFile():
with open('music.mp3', 'rb') as in_file:
data = len(in_file.read())
with open('output.mp3', 'wb') as out_file:
ndata = newBytes(data)
out_file.write(in_file.read()[:ndata])
def newBytes(bytes):
newLength = (bytes/100) * 30
return int(newLength)
loadFile()
You are trying to read your file a second time which will result in no data, e.g. len(in_file.read(). Instead read the whole file into a variable and then calculate the length of that. The variable can then be used a second time.
def newBytes(bytes):
return (bytes * 70) / 100
def loadFile():
with open('music.mp3', 'rb') as in_file:
data = in_file.read()
with open('output.mp3', 'wb') as out_file:
ndata = newBytes(len(data))
out_file.write(data[:ndata])
Also it is better to multiply first and then divide to avoid having to work with floating point numbers.
You cannot reliably truncate an MP3 file by byte size and expect it to be equivalently truncated in audio time length.
MP3 frames can change bitrate. While your method will sort of work, it won't be all that accurate. Additionally, you'll undoubtedly break frames leaving glitches at the end of the file. You will also lose ID3v1 tags (if you still use them... better to use ID3v2 anyway).
Consider executing FFmpeg with -acodec copy instead. This will simply copy the bytes over while maintaining the integrity of the file, and ensuring a good clean cut where you want it to be.

Getting hash (digest) of a file in Python - reading whole file at once vs reading line by line

I need to get a hash (digest) of a file in Python.
Generally, when processing any file content it is adviced to process it gradually line by line due to memory concerns, yet I need a whole file to be loaded in order to obtain its digest.
Currently I'm obtaining hash in this way:
import hashlib
def get_hash(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
data = file.read()
h.update(data)
digest = h.hexdigest()
return digest
Is there any other way to perform this in more optimized or cleaner manner?
Is there any significant improvement in reading file gradually line by line over reading whole file at once when still the whole file must be loaded to calculate the hash?
According to the documentation for hashlib.update(), you don't need to concern yourself over the block size of different hashing algorithms. However, I'd test that a bit. But, it seems to check out, 512 is the block size of MD5, and if you change it to anything else, the results are the same as reading it all in at once.
import hashlib
def get_hash(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
data = file.read()
h.update(data)
digest = h.hexdigest()
return digest
def get_hash_memory_optimized(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
block = file.read(512)
while block:
h.update(block)
block = file.read(512)
return h.hexdigest()
digest = get_hash('large_bin_file')
print(digest)
digest = get_hash_memory_optimized('large_bin_file')
print(digest)
> bcf32baa9b05ca3573bf568964f34164
> bcf32baa9b05ca3573bf568964f34164
Of course you can load data in chunks, so that the memory usage drops significantly as you no more have to load the whole file. Then you use hash.update(chunk) for each chunk:
from functools import partial
Hash = hashlib.new("sha1")
size = 128 # just an example
with open("data.txt", "rb") as File:
for chunk in iter(partial(f.read, size), b''):
Hash.update(chunk)
I find this iter trick very neat because it allows to write much cleaner code. It may look confusing at first, so I'll explain how it works:
iter(function, sentinel) executes function successively and yields the values it returns until one of them is equal to sentinel.
partial(f.read, size) returns a callable version of f.read(size). This is oversimplified, but still correct in this case.
You get the same result with both snippets:
h = hashlib.new("md5")
with open(filename,"rb") as f:
for line in f:
h.update(line)
print(h.hexdigest())
and
h = hashlib.new("md5")
with open(filename,"rb") as f:
h.update(f.read())
print(h.hexdigest())
A few notes:
the first approach works best with big text files, memory-wise. With binary file, there's no such thing as a "line". It will work, though, but a "chunk" approach is more regular (not going to paraphrase other answers)
the second approach eats a lot of memory if the file is big
in both cases, make sure that you open the file in binary mode, or end-of-line conversion could lead to wrong checksum (external tools would compute a different MD5 than your program)

Breaking a File into Blocks

Working on an assignment for a self-study course that I'm taking in cryptography (I'm receiving no credit for this class). I need to compute hash values on a large file where the hash is done block by block. The thing that I am stumped on at the moment is how to break up the file into these blocks? I'm using python, which I'm very new to.
f = open('myfile', 'rb')
BLOCK_SIZE = 1024
m = Crypto.Hash.SHA256.new()
thisHash = ""
blocks = os.path.getsize('myfile') / BLOCK_SIZE #ignore partial last block for now
for i in Range(blocks):
b = f.read(BLOCK_SIZE)
thisHash = m.update(b.encode())
f.seek(block_size, os.SEEK_CUR)
Am I approaching this correctly? The code seems to run up until the m.update(b.encode()) line executes. I don't know if I am way off base or what to do to make this work. Any advice is appreciated. Thanks!
(note: as you might notice, this code doesn't really produce anything at the moment - I'm just getting some of the scaffolding set up)
You'll have to do a few things to make this example work correctly. Here are some points:
Crypto.Hash.SHA256.SHA256Hash.update() (you invoke it as m.update()) has no return value. To pull a human-readable hash out of the object, .update() it a bunch of times and then call .hexdigest()
You don't need to encode binary data before feeding it to the .update() function. Just pass the string containing the data block.
File pointers are advanced by file.read(). You don't need a separate .seek() operation.
.read() will return an empty string if you've hit EOF already. This is totally fine. Feel free just to pull in that partial block.
Variable names are case-sensitive. block_size is not the same variable as BLOCK_SIZE.
Making these few minor adjustments, and assuming you have all the right imports, you'll be on the right track.
Alternative solution would be breaking the file into blocks first and then perform hash block by block
This will break the file into chunks of 1024 bytes
with open(file,'rb') as f:
while True:
chunk = f.read(1024)
if chunk:
fList.append(chunk)
else:
numBlocks = len(fList)
break
Note: last block size may be less than 1024 bytes
Now you can do the hash in whichever you want to.

Categories