Can someone please explain how this function works - python

I found this function when looking up how to count lines in a file,
but i have no idea how it works.
def _count_generator(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024 * 1024)
with open('test.txt', 'rb') as fp:
c_generator = _count_generator(fp.raw.read)
# count each new line
count = sum(buffer.count(b'\n') for buffer in c_generator)
print('total lines', count + 1)
I understand that its reading it as a byte object, but i dont understand
what the reader(1024 * 1024) does or how exactly the whole thing works
Any help is appreciated
Thanks.

open() returns a file object. Since it's opening the file with rb (read-binary), it returns a io.BufferedReader. The underlying raw buffer can be retrieved via the .raw property, which is a RawIOBase - its method, RawIOBase.read, is passed to _count_generator.
Since _count_generator is a generator it is an iterable. Its purpose is to read 1mb of data in the file and yield that data back to the caller on every invocation until the file is over - when the buffer b is done reader() returns 0 bytes, stopping the loop.
The caller uses that 1mb of data and counts the amount of new lines in it via sum function, over and over again, until the file is exhausted.
tl;dr You are reading a file 1mb at a time and summing its newlines. Why? Because more likely than not you cannot open the entire file since it's too large to be opened all at once in memory.

Let's start with the argument to the function. fp.raw.read is the read method of the raw reader of the binary file fp. The read method accepts an integer that tells it how many bytes to read. It returns an empty bytes on EOF.
The function itself is a generator. It lazily calls read to get up to 1MB of data at a time. The chunks are not read until requested by the generator in sum, which counts newlines. Raw read with a positive integer argument will only make one call to the underlying OS, so 1MB is just a hint in this case: most of the time it will read one disk block, usually around 4KB or so.
This program has two immediately apparent flaws if you take the time to read the documentation.
raw is not guaranteed to exist in every implementation of python:
This is not part of the BufferedIOBase API and may not exist on some implementations.
read in non-blocking mode can return None when no data is available but EOF has not been reached. Only empty bytes indicates EOF, so the while loop should be while b != b'':.

Related

How can I load a file with buffers in python?

hope you are having a great day!
In my recent ventures with Python 3.8.5 I have come across a dilemma I must say...
Being that I am a fairly new programmer I am afraid that I don't have the technical knowledge to load a single (BIG) file into the program.
To make my question much more understandable lets look at this down below:
Lets say that there is a file on my system called "File.mp4" or "File.txt" (1GB in size);
I want to load this file into my program using the open function as rb;
I declared a buffer size of 1024;
This is the part I don't know how to solve
I load 1024 worth of bytes into the program
I do whatever I need to do with it
I then load another 1024 bytes in the place of the old buffer
Rinse and repeat until the whole file has been ran trough.
I looked at this question but either it is not good for my case or I just don't know how to implement it -> link to the question
This is the whole code you requested:
BUFFER = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(BUFFER)) != '':
print(list(chunk))
You can use buffered input from io with bytearray:
import io
buf = bytearray(1024)
with io.open(filename, 'rb') as fp:
size = fp.readinto(buf)
if not size:
break
# do things with buf considering the size
This is one of the situations that python 3.8's new walrus operator - which both assigns a value to a variable, and returns the value that it just assigned - is really good for. You can use file.read(size) to read in 1024-byte chunks, and simply stop when there's no more file left to read:
buffer_size = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(buffer_size)) != b'':
# do things with the variable `chunk`, which should have len() == 1024
Note that the != b'' part of the condition can be safely removed, as the empty string will evaluate to False when used as a boolean expression.

read a large file without memory reallocation

When I want to read a binary file in memory in python I just do:
with open("file.bin","rb") as f:
contents = f.read()
With "reasonable" size files, it's perfect, but when the files are huge (say, 1Gb or more), when monitoring the process, we notice that the memory increases then shrinks, then increases, ... probably the effect of realloc behind the scenes, when the original chunk of memory is too small to hold the file.
Done several times, this realloc + memmove operation takes a lot of CPU time. In C, I wouldn't have the problem because I would pass a properly allocated buffer to fread for instance (but here I can't because bytes objects are immutable, so I cannot pre-allocate).
Of course I could read it chunk by chunk like this:
with open("file.bin","rb") as f:
while True:
contents = f.read(CHUNK_SIZE)
if contents:
chunks.append(contents)
else:
break
but then I would have to join the bytes chunks, but that would also take twice the needed memory at some point, and I may not be able to afford it.
Is there a method to read a big binary file in a buffer with one sole big memory allocation, and efficiently CPU-wise?
You can use the os.open method, which is basically a wrapper around the POSIX syscall open.
import os
fd = os.open("file.bin", os.O_RDONLY | os.O_BINARY)
This opens the file in rb mode.
os.open returns a file descriptor which does not have read methods. You'll have to read n bytes at a time:
data = os.read(fd, 100)
Once done, use os.close to close the file:
os.close(fd)
You're reading a file in Python just like you'd do it in C!
Here's a couple of useful references:
Official docs
Library Reference
Disclaimer: Based on my knowledge of how C's open function works, I believe this should do the trick.

Reading RestResponse in Chunks

To avoid MemoryError's in Python, I am trying to read in chunks. Been searching for half a day on how to read chunks form a RestResponse but to no avail.
The source is a file-like object using the Dropbox SDK for python.
Here's my attempt:
import dropbox
from filechunkio import FileChunkIO
import math
file_and_metadata = dropbox_client.metadata(path)
hq_file = dropbox_client.get_file(file_and_metadata['path'])
source_size = file_and_metadata['bytes']
chunk_size = 4194304
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(hq_file, 'r', offset=offset,
bytes=bytes) as fp:
with open('tmp/testtest123.mp4', 'wb') as f:
f.write(fp)
f.flush()
This results in "TypeError: coercing to Unicode: need string or buffer, RESTResponse found"
Any clues or solutions would be greatly appreciated.
Without knowing anything about FileChunkIO, or even knowing where your code is raising an exception, it's hard to be sure, but my guess is that it needs a real file-like object. Or maybe it does something silly, like checking the type so it can decide whether you're looking to chunk up a string or chunk up a file.
Anyway, according to the docs, RESTResponse isn't a full file-like object, but it implements read and close. And you can easily chunk something that implements read without any fancy wrappers. File-like objects' read methods are guaranteed to return b'' when you get to EOF, and can return fewer bytes than you asked for, so you don't need to guess how many times you need to read and do a short read at the end. Just do this:
chunk_size = 4194304
with open('tmp/testtest123.mp4', 'wb') as f:
while True:
buf = hq_file.read(chunk_size)
if not buf:
break
f.write(buf)
(Notice that I moved the open outside of the loop. Otherwise, for each chunk, you're going to open and empty out the file, then write the next chunk, so at the end you'll end up with just the last one.)
If you want a chunking wrapper, there's a perfectly good builtin function, iter, that can do it for you:
chunk_size = 4194304
chunks = iter(lambda: hq_file.read(chunk_size), '')
with open('tmp/testtest123.mp4', 'wb') as f:
f.writelines(chunks)
Note that the exact same code works in Python 3.x if you change that '' to b'', but that breaks Python 2.5.
This might be a bit of an abuse of writelines, because we're writing an iterable of strings that aren't actually lines. If you don't like it, an explicit loop is just as simple and not much less concise.
I usually write that as partial(hq_file.read, chunk_size) rather than lambda: hq_file.read(chunk_size_, but it's really a matter of preference; read the docs on partial and you should be able to understand why they ultimately have the same effect, and decide which one you prefer.

Python Generator memory benefits for large readins?

I'm wondering about the memory benefits of python generators in this use case (if any). I wish to read in a large text file that must be shared between all objects. Because it only needs to be used once and the program finishes once the list is exhausted I was planning on using generators.
The "saved state" of a generator I believe lets it keep track of what is the next value to be passed to whatever object is calling it. I've read that generators also save memory usage by not returning all the values at once, but rather calculating them on the fly. I'm a little confused if I'd get any benefit in this use case though.
Example Code:
def bufferedFetch():
while True:
buffer = open("bigfile.txt","r").read().split('\n')
for i in buffer:
yield i
Considering that the buffer is going to be reading in the entire "bigfile.txt" anyway, wouldn't this be stored within the generator, for no memory benefit? Is there a better way to return the next value of a list that can be shared between all objects?
Thanks.
In this case no. You are reading the entire file into memory by doing .read().
What you ideally want to do instead is:
def bufferedFetch():
with open("bigfile.txt","r") as f:
for line in f:
yield line
The python file object takes care of line endings for you (system dependent) and it's built-in iterator will yield lines by simply iterating over it one line at a time (not reading the entire file into memory).

Handling big text files in Python

Basics are that I need to process 4gig text files on a per line basis.
using .readline() or for line in f is great for memory but takes ages to IO. Would like to use something like yield, but that (I think) will chop lines.
POSSIBLE ANSWER:
file.readlines([sizehint])ΒΆ
Read until EOF using readline() and return a list containing the lines
thus read. If the optional sizehint
argument is present, instead of
reading up to EOF, whole lines
totalling approximately sizehint bytes
(possibly after rounding up to an
internal buffer size) are read.
Objects implementing a file-like
interface may choose to ignore
sizehint if it cannot be implemented,
or cannot be implemented efficiently.
Didn't realize you could do this!
You can just iterate over the file object:
with open("filename") as f:
for line in f:
whatever
This will do some internal buffering to improve the performance. (Note that file.readline() will perform considerably worse because it does not buffer -- that's why you can't mix iteration over a file object with file.readline().)
If you want to do something on a per-line basis you can just loop over the file object:
f = open("w00t.txt")
for line in f:
# do stuff
However, doing stuff on a per-line basis can be a actual bottleneck of performance, so perhaps you should use a better chunk size? What you can do is, for example, read 4096 bytes, find the last line ending \n, process on that part and prepend the part that is left to the next chunk.
You could always chunk the lines up? I mean why open one file and iterate all the way through when you can open the same file 6 times and iterate through.
e.g.
a #is the first 1024 bytes
b #is the next 1024
#etcetc
f #is the last 1024 bytes
Each file handle running in a separate process and we start to cook on gas. Just remember to deal with line endings properly.

Categories