Breaking a File into Blocks - python

Working on an assignment for a self-study course that I'm taking in cryptography (I'm receiving no credit for this class). I need to compute hash values on a large file where the hash is done block by block. The thing that I am stumped on at the moment is how to break up the file into these blocks? I'm using python, which I'm very new to.
f = open('myfile', 'rb')
BLOCK_SIZE = 1024
m = Crypto.Hash.SHA256.new()
thisHash = ""
blocks = os.path.getsize('myfile') / BLOCK_SIZE #ignore partial last block for now
for i in Range(blocks):
b = f.read(BLOCK_SIZE)
thisHash = m.update(b.encode())
f.seek(block_size, os.SEEK_CUR)
Am I approaching this correctly? The code seems to run up until the m.update(b.encode()) line executes. I don't know if I am way off base or what to do to make this work. Any advice is appreciated. Thanks!
(note: as you might notice, this code doesn't really produce anything at the moment - I'm just getting some of the scaffolding set up)

You'll have to do a few things to make this example work correctly. Here are some points:
Crypto.Hash.SHA256.SHA256Hash.update() (you invoke it as m.update()) has no return value. To pull a human-readable hash out of the object, .update() it a bunch of times and then call .hexdigest()
You don't need to encode binary data before feeding it to the .update() function. Just pass the string containing the data block.
File pointers are advanced by file.read(). You don't need a separate .seek() operation.
.read() will return an empty string if you've hit EOF already. This is totally fine. Feel free just to pull in that partial block.
Variable names are case-sensitive. block_size is not the same variable as BLOCK_SIZE.
Making these few minor adjustments, and assuming you have all the right imports, you'll be on the right track.

Alternative solution would be breaking the file into blocks first and then perform hash block by block
This will break the file into chunks of 1024 bytes
with open(file,'rb') as f:
while True:
chunk = f.read(1024)
if chunk:
fList.append(chunk)
else:
numBlocks = len(fList)
break
Note: last block size may be less than 1024 bytes
Now you can do the hash in whichever you want to.

Related

How to temporary save data in Python?

I read position data from a GPS Sensor in a dictionary, which I am sending in cyclic interval to a server.
If I have no coverage, the data will be saved in a list.
If connection can be reestablished, all list items will be transmitted.
But if the a power interruption occurs, all temp data elements will be lost.
What would be the best a pythonic solution to save this data?
I am using a SD card as storage, so i am not sure, if writing every element to a file would be the best solution.
Current implementation:
stageddata = []
position = {'lat':'1.2345', 'lon':'2.3455', 'timestamp':'2020-10-18T15:08:04'}
if not transmission(position):
stageddata.append(position)
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
EDIT: Finding the "best" solution may be very subjective. I agree with zvone, a power outage can be prevented. Perhaps a shutdown routine should save the temporary data.
So question may be how to pythonic save a given list to a file?
A good solution for temporary storage in Python is tempfile.
You can use it, e.g., like the following:
import tempfile
with tempfile.TemporaryFile() as fp:
# Store your varibale
fp.write(your_variable_to_temp_store)
# Do some other stuff
# Read file
fp.seek(0)
fp.read()
I agree with the comment of zvone. In order to know the best solution, we would need more information.
The following would be a robust and configurable solution.
import os
import pickle
backup_interval = 2
backup_file = 'gps_position_backup.bin'
def read_backup_data():
file_backup_data = []
if os.path.exists(backup_file):
with open(backup_file, 'rb') as f:
while True:
try:
coordinates = pickle.load(f)
except EOFError:
break
file_backup_data += coordinates
return file_backup_data
# When the script is started and backup data exists, stageddata uses it
stageddata = read_backup_data()
def write_backup_data():
tmp_backup_file = 'tmp_' + backup_file
with open(tmp_backup_file, 'wb') as f:
pickle.dump(stageddata, f)
os.replace(tmp_backup_file, backup_file)
print('Wrote data backup!')
# Mockup variable and method
transmission_return = False
def transmission(position):
return transmission_return
def try_transmission(position):
if not transmission(position):
stageddata.append(position)
if len(stageddata) % backup_interval == 0:
write_backup_data()
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
else:
if len(stageddata) % backup_interval == 0:
write_backup_data()
if __name__ == '__main__':
# transmission_return is False, so write to backup_file
for counter in range(10):
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
# transmission_return is True, transmit positions and "update" backup_file
transmission_return = True
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
I moved your code into some some functions. With the variable backup_interval, it is possible to control how often a backup is written to disk.
Additional Notes:
I use the built-in pickle module, since the data does not have to be human readable or transformable for other programming languages. Alternatives are JSON, which is human readable, or msgpack, which might be faster, but needs an extra package to be installed. The tempfile is not a pythonic solution, as it cannot easily be retrieved in case the program crashes.
stageddata is written to disk when it hits the backup_interval (obviously), but also when transmission returns True within the while loop. This is needed to "synchronize" the data on disk.
The data is written to disk completely new every time. A more sophisticated approach would be to just append the newly added positions, but then the synchronizing part, that I described before, would be more complicated too. Additionally, the safer temporary file approach (see Edit below) would not work.
Edit: I just reconsidered your use case. The main problem here is: Restoring data, even if the program gets interrupted at any time (due to power interruption or whatever). My first solution just wrote the data to disk (which solves part of the problem), but it could still happen, that the program crashes the moment when writing to disk. In that case the file would probably be corrupt and the data lost. I adapted the function write_backup_data(), so that it writes to a temporary file first and then replaces the old file. So now, even if a lot of data has to be written to disk and the crash happens there, the previous backup file would still be available.
Maybe saving it as a binary code could help to minimize the storage. 'pickle' and 'shelve' modules will help with storing objects and serializing (To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object), but you should be carefull that when you resolve the power interruption it does not overwrite the data you have been storing, with open(file, "a") (a== append), you could avoid that.

How can I load a file with buffers in python?

hope you are having a great day!
In my recent ventures with Python 3.8.5 I have come across a dilemma I must say...
Being that I am a fairly new programmer I am afraid that I don't have the technical knowledge to load a single (BIG) file into the program.
To make my question much more understandable lets look at this down below:
Lets say that there is a file on my system called "File.mp4" or "File.txt" (1GB in size);
I want to load this file into my program using the open function as rb;
I declared a buffer size of 1024;
This is the part I don't know how to solve
I load 1024 worth of bytes into the program
I do whatever I need to do with it
I then load another 1024 bytes in the place of the old buffer
Rinse and repeat until the whole file has been ran trough.
I looked at this question but either it is not good for my case or I just don't know how to implement it -> link to the question
This is the whole code you requested:
BUFFER = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(BUFFER)) != '':
print(list(chunk))
You can use buffered input from io with bytearray:
import io
buf = bytearray(1024)
with io.open(filename, 'rb') as fp:
size = fp.readinto(buf)
if not size:
break
# do things with buf considering the size
This is one of the situations that python 3.8's new walrus operator - which both assigns a value to a variable, and returns the value that it just assigned - is really good for. You can use file.read(size) to read in 1024-byte chunks, and simply stop when there's no more file left to read:
buffer_size = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(buffer_size)) != b'':
# do things with the variable `chunk`, which should have len() == 1024
Note that the != b'' part of the condition can be safely removed, as the empty string will evaluate to False when used as a boolean expression.

Is there any way to find the buffer size of a file object

I'm trying to "map" a very large ascii file. Basically I read lines until I find a certain tag and then I want to know the position of that tag so that I can seek to it again later to pull out the associated data.
from itertools import dropwhile
with open(datafile) as fin:
ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
header = next(ifin)
position = fin.tell()
Now this tell doesn't give me the right position. This question has been asked in various forms before. The reason is presumably because python is buffering the file object. So, python is telling me where it's file-pointer is, not where my file pointer is. I don't want to turn off this buffering ... The performance here is important. However, it would be nice to know if there is a way to determine how many bytes python chooses to buffer. In my actual application, as long as I'm close the the lines which start with Foo, it doesn't matter. I can drop a few lines here and there. So, what I'm actually planning on doing is something like:
position = fin.tell() - buffer_size(fin)
Is there any way to go about finding the buffer size?
To me, it looks like the buffer size is hard-coded in Cpython to be 8192. As far as I can tell, there is no way to get this number from the python interface other than to read a single line when you open the file, do a f.tell() to figure out how much data python actually read and then seek back to the start of the file before continuing.
with open(datafile) as fin:
next(fin)
bufsize = fin.tell()
fin.seek(0)
ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
header = next(ifin)
position = fin.tell()
Of course, this fails in the event that the first line is longer than 8192 bytes long, but that's not of any real consequence for my application.

Keep Track of Number of Bytes Read

I would like to implement a command line progress bar for one of my programs IN PYTHON which reads text from a file line by line.
I can implement the progress scale in one of two ways:
(number of lines / total lines) or
(number of bytes completed / bytes total)
I don't care which, but "number of lines" would seem to require me to loop through the entire document (which could be VERY large) just to get the value for "total lines".
This seems extremely inefficient. I was thinking outside the box and thought perhaps if I took the size of the file (easier to get?) and kept track of the number of bytes that have been read, it might make for a good progress bar metric.
I can use os.path.getsize(file) or os.stat(file).st_size to retrieve the size of the file, but I have not yet found a way to keep track of the number of bytes read by readline(). The files I am working with should be encoded in ASCII, or maybe even Unicode, so... should I just determine the encoding used and then record the number of characters read or use os.getsizeof() or some len() function for each line read?
I am sure there will be problems here. Any suggestions?
(P.S. - I don't think manually inputting the number of bytes to read at a time will work, because I need to work with each line individually; or else I will need to split it up afterwards by "\n"'s.)
bytesread = 0
while True:
line = fh.readline()
if line == '':
break
bytesread += len(line)
Or, a little shorter:
bytesread = 0
for line in fh:
bytesread += len(line)
Using os.path.getsize() (or os.stat) is an efficient way of determining the file size.

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

Categories