How can I load a file with buffers in python? - python

hope you are having a great day!
In my recent ventures with Python 3.8.5 I have come across a dilemma I must say...
Being that I am a fairly new programmer I am afraid that I don't have the technical knowledge to load a single (BIG) file into the program.
To make my question much more understandable lets look at this down below:
Lets say that there is a file on my system called "File.mp4" or "File.txt" (1GB in size);
I want to load this file into my program using the open function as rb;
I declared a buffer size of 1024;
This is the part I don't know how to solve
I load 1024 worth of bytes into the program
I do whatever I need to do with it
I then load another 1024 bytes in the place of the old buffer
Rinse and repeat until the whole file has been ran trough.
I looked at this question but either it is not good for my case or I just don't know how to implement it -> link to the question
This is the whole code you requested:
BUFFER = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(BUFFER)) != '':
print(list(chunk))

You can use buffered input from io with bytearray:
import io
buf = bytearray(1024)
with io.open(filename, 'rb') as fp:
size = fp.readinto(buf)
if not size:
break
# do things with buf considering the size

This is one of the situations that python 3.8's new walrus operator - which both assigns a value to a variable, and returns the value that it just assigned - is really good for. You can use file.read(size) to read in 1024-byte chunks, and simply stop when there's no more file left to read:
buffer_size = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(buffer_size)) != b'':
# do things with the variable `chunk`, which should have len() == 1024
Note that the != b'' part of the condition can be safely removed, as the empty string will evaluate to False when used as a boolean expression.

Related

Can someone please explain how this function works

I found this function when looking up how to count lines in a file,
but i have no idea how it works.
def _count_generator(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024 * 1024)
with open('test.txt', 'rb') as fp:
c_generator = _count_generator(fp.raw.read)
# count each new line
count = sum(buffer.count(b'\n') for buffer in c_generator)
print('total lines', count + 1)
I understand that its reading it as a byte object, but i dont understand
what the reader(1024 * 1024) does or how exactly the whole thing works
Any help is appreciated
Thanks.
open() returns a file object. Since it's opening the file with rb (read-binary), it returns a io.BufferedReader. The underlying raw buffer can be retrieved via the .raw property, which is a RawIOBase - its method, RawIOBase.read, is passed to _count_generator.
Since _count_generator is a generator it is an iterable. Its purpose is to read 1mb of data in the file and yield that data back to the caller on every invocation until the file is over - when the buffer b is done reader() returns 0 bytes, stopping the loop.
The caller uses that 1mb of data and counts the amount of new lines in it via sum function, over and over again, until the file is exhausted.
tl;dr You are reading a file 1mb at a time and summing its newlines. Why? Because more likely than not you cannot open the entire file since it's too large to be opened all at once in memory.
Let's start with the argument to the function. fp.raw.read is the read method of the raw reader of the binary file fp. The read method accepts an integer that tells it how many bytes to read. It returns an empty bytes on EOF.
The function itself is a generator. It lazily calls read to get up to 1MB of data at a time. The chunks are not read until requested by the generator in sum, which counts newlines. Raw read with a positive integer argument will only make one call to the underlying OS, so 1MB is just a hint in this case: most of the time it will read one disk block, usually around 4KB or so.
This program has two immediately apparent flaws if you take the time to read the documentation.
raw is not guaranteed to exist in every implementation of python:
This is not part of the BufferedIOBase API and may not exist on some implementations.
read in non-blocking mode can return None when no data is available but EOF has not been reached. Only empty bytes indicates EOF, so the while loop should be while b != b'':.

Getting hash (digest) of a file in Python - reading whole file at once vs reading line by line

I need to get a hash (digest) of a file in Python.
Generally, when processing any file content it is adviced to process it gradually line by line due to memory concerns, yet I need a whole file to be loaded in order to obtain its digest.
Currently I'm obtaining hash in this way:
import hashlib
def get_hash(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
data = file.read()
h.update(data)
digest = h.hexdigest()
return digest
Is there any other way to perform this in more optimized or cleaner manner?
Is there any significant improvement in reading file gradually line by line over reading whole file at once when still the whole file must be loaded to calculate the hash?
According to the documentation for hashlib.update(), you don't need to concern yourself over the block size of different hashing algorithms. However, I'd test that a bit. But, it seems to check out, 512 is the block size of MD5, and if you change it to anything else, the results are the same as reading it all in at once.
import hashlib
def get_hash(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
data = file.read()
h.update(data)
digest = h.hexdigest()
return digest
def get_hash_memory_optimized(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
block = file.read(512)
while block:
h.update(block)
block = file.read(512)
return h.hexdigest()
digest = get_hash('large_bin_file')
print(digest)
digest = get_hash_memory_optimized('large_bin_file')
print(digest)
> bcf32baa9b05ca3573bf568964f34164
> bcf32baa9b05ca3573bf568964f34164
Of course you can load data in chunks, so that the memory usage drops significantly as you no more have to load the whole file. Then you use hash.update(chunk) for each chunk:
from functools import partial
Hash = hashlib.new("sha1")
size = 128 # just an example
with open("data.txt", "rb") as File:
for chunk in iter(partial(f.read, size), b''):
Hash.update(chunk)
I find this iter trick very neat because it allows to write much cleaner code. It may look confusing at first, so I'll explain how it works:
iter(function, sentinel) executes function successively and yields the values it returns until one of them is equal to sentinel.
partial(f.read, size) returns a callable version of f.read(size). This is oversimplified, but still correct in this case.
You get the same result with both snippets:
h = hashlib.new("md5")
with open(filename,"rb") as f:
for line in f:
h.update(line)
print(h.hexdigest())
and
h = hashlib.new("md5")
with open(filename,"rb") as f:
h.update(f.read())
print(h.hexdigest())
A few notes:
the first approach works best with big text files, memory-wise. With binary file, there's no such thing as a "line". It will work, though, but a "chunk" approach is more regular (not going to paraphrase other answers)
the second approach eats a lot of memory if the file is big
in both cases, make sure that you open the file in binary mode, or end-of-line conversion could lead to wrong checksum (external tools would compute a different MD5 than your program)

Reading RestResponse in Chunks

To avoid MemoryError's in Python, I am trying to read in chunks. Been searching for half a day on how to read chunks form a RestResponse but to no avail.
The source is a file-like object using the Dropbox SDK for python.
Here's my attempt:
import dropbox
from filechunkio import FileChunkIO
import math
file_and_metadata = dropbox_client.metadata(path)
hq_file = dropbox_client.get_file(file_and_metadata['path'])
source_size = file_and_metadata['bytes']
chunk_size = 4194304
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(hq_file, 'r', offset=offset,
bytes=bytes) as fp:
with open('tmp/testtest123.mp4', 'wb') as f:
f.write(fp)
f.flush()
This results in "TypeError: coercing to Unicode: need string or buffer, RESTResponse found"
Any clues or solutions would be greatly appreciated.
Without knowing anything about FileChunkIO, or even knowing where your code is raising an exception, it's hard to be sure, but my guess is that it needs a real file-like object. Or maybe it does something silly, like checking the type so it can decide whether you're looking to chunk up a string or chunk up a file.
Anyway, according to the docs, RESTResponse isn't a full file-like object, but it implements read and close. And you can easily chunk something that implements read without any fancy wrappers. File-like objects' read methods are guaranteed to return b'' when you get to EOF, and can return fewer bytes than you asked for, so you don't need to guess how many times you need to read and do a short read at the end. Just do this:
chunk_size = 4194304
with open('tmp/testtest123.mp4', 'wb') as f:
while True:
buf = hq_file.read(chunk_size)
if not buf:
break
f.write(buf)
(Notice that I moved the open outside of the loop. Otherwise, for each chunk, you're going to open and empty out the file, then write the next chunk, so at the end you'll end up with just the last one.)
If you want a chunking wrapper, there's a perfectly good builtin function, iter, that can do it for you:
chunk_size = 4194304
chunks = iter(lambda: hq_file.read(chunk_size), '')
with open('tmp/testtest123.mp4', 'wb') as f:
f.writelines(chunks)
Note that the exact same code works in Python 3.x if you change that '' to b'', but that breaks Python 2.5.
This might be a bit of an abuse of writelines, because we're writing an iterable of strings that aren't actually lines. If you don't like it, an explicit loop is just as simple and not much less concise.
I usually write that as partial(hq_file.read, chunk_size) rather than lambda: hq_file.read(chunk_size_, but it's really a matter of preference; read the docs on partial and you should be able to understand why they ultimately have the same effect, and decide which one you prefer.

Breaking a File into Blocks

Working on an assignment for a self-study course that I'm taking in cryptography (I'm receiving no credit for this class). I need to compute hash values on a large file where the hash is done block by block. The thing that I am stumped on at the moment is how to break up the file into these blocks? I'm using python, which I'm very new to.
f = open('myfile', 'rb')
BLOCK_SIZE = 1024
m = Crypto.Hash.SHA256.new()
thisHash = ""
blocks = os.path.getsize('myfile') / BLOCK_SIZE #ignore partial last block for now
for i in Range(blocks):
b = f.read(BLOCK_SIZE)
thisHash = m.update(b.encode())
f.seek(block_size, os.SEEK_CUR)
Am I approaching this correctly? The code seems to run up until the m.update(b.encode()) line executes. I don't know if I am way off base or what to do to make this work. Any advice is appreciated. Thanks!
(note: as you might notice, this code doesn't really produce anything at the moment - I'm just getting some of the scaffolding set up)
You'll have to do a few things to make this example work correctly. Here are some points:
Crypto.Hash.SHA256.SHA256Hash.update() (you invoke it as m.update()) has no return value. To pull a human-readable hash out of the object, .update() it a bunch of times and then call .hexdigest()
You don't need to encode binary data before feeding it to the .update() function. Just pass the string containing the data block.
File pointers are advanced by file.read(). You don't need a separate .seek() operation.
.read() will return an empty string if you've hit EOF already. This is totally fine. Feel free just to pull in that partial block.
Variable names are case-sensitive. block_size is not the same variable as BLOCK_SIZE.
Making these few minor adjustments, and assuming you have all the right imports, you'll be on the right track.
Alternative solution would be breaking the file into blocks first and then perform hash block by block
This will break the file into chunks of 1024 bytes
with open(file,'rb') as f:
while True:
chunk = f.read(1024)
if chunk:
fList.append(chunk)
else:
numBlocks = len(fList)
break
Note: last block size may be less than 1024 bytes
Now you can do the hash in whichever you want to.

How can I change a huge file into csv in python

I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.
#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.
Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.
It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.
Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)
With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.
#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.
If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()
[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)

Categories