Writing binary data to middle of a sparse file - python

I need to compile a binary file in pieces with pieces arriving in random order (yes, its a P2P project)
def write(filename, offset, data)
file.open(filename, "ab")
file.seek(offset)
file.write(data)
file.close()
Say I have a 32KB write(f, o, d) at offset 1MB into file and then another 32KB write(f, o, d) at offset 0
I end up with a file 65KB in length (i.e. the gap consisting of 0s between 32KB - 1MB is truncated/disappears)
I am aware this may appear an incredibly stupid question, but I cannot seem to figure it out from the file.open(..) modes
Advice gratefully received.
*** UPDATE
My method to write P2P pieces ended up as follows (for those who may glean some value from it)
def writePiece(self, filename, pieceindex, bytes, ipsrc, ipdst, ts):
file = open(filename,"r+b")
if not self.piecemap[ipdst].has_key(pieceindex):
little = struct.pack('<'+'B'*len(bytes), *bytes)
# Seek to offset based on piece index
file.seek(pieceindex * self.piecesize)
file.write(little)
file.flush()
self.procLog.info("Wrote (%d) bytes of piece (%d) to %s" % (len(bytes), pieceindex, filename))
# Remember we have this piece now in case duplicates arrive
self.piecemap[ipdst][pieceindex] = True
file.close()
Note: I also addressed some endian issues using struct.pack which plagued me for a while.
For anyone wondering, the project I am working on is to analyse BT messages captured directly off the wire.

>>> import os
>>> filename = 'tempfile'
>>> def write(filename,data,offset):
... try:
... f = open(filename,'r+b')
... except IOError:
... f = open(filename,'wb')
... f.seek(offset)
... f.write(data)
... f.close()
...
>>> write(filename,'1' * (1024*32),1024*1024)
>>> write(filename,'1' * (1024*32),0)
>>> os.path.getsize(filename)
1081344

You opened the file in append ("a") mode. All writes are going to the end of the file, irrespective of the calls to seek().

Try using 'r+b' rather than 'ab'.

It seems to me like there's not a lot of point in trying to assemble the file until all the pieces of it are there. Why not keep the pieces separate until all are present, then write them to the final file in order? That's what most P2P apps do, AFAIK.

Related

How can I load a file with buffers in python?

hope you are having a great day!
In my recent ventures with Python 3.8.5 I have come across a dilemma I must say...
Being that I am a fairly new programmer I am afraid that I don't have the technical knowledge to load a single (BIG) file into the program.
To make my question much more understandable lets look at this down below:
Lets say that there is a file on my system called "File.mp4" or "File.txt" (1GB in size);
I want to load this file into my program using the open function as rb;
I declared a buffer size of 1024;
This is the part I don't know how to solve
I load 1024 worth of bytes into the program
I do whatever I need to do with it
I then load another 1024 bytes in the place of the old buffer
Rinse and repeat until the whole file has been ran trough.
I looked at this question but either it is not good for my case or I just don't know how to implement it -> link to the question
This is the whole code you requested:
BUFFER = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(BUFFER)) != '':
print(list(chunk))
You can use buffered input from io with bytearray:
import io
buf = bytearray(1024)
with io.open(filename, 'rb') as fp:
size = fp.readinto(buf)
if not size:
break
# do things with buf considering the size
This is one of the situations that python 3.8's new walrus operator - which both assigns a value to a variable, and returns the value that it just assigned - is really good for. You can use file.read(size) to read in 1024-byte chunks, and simply stop when there's no more file left to read:
buffer_size = 1024
with open('file.txt', 'rb') as f:
while (chunk := f.read(buffer_size)) != b'':
# do things with the variable `chunk`, which should have len() == 1024
Note that the != b'' part of the condition can be safely removed, as the empty string will evaluate to False when used as a boolean expression.

Chunking with Python

Hi and happy holidays to everyone!
I have to cope with big csv files (around 5GB each) on a simple laptop, so I am learning to read files in chunks (I am a complete noob in this), using python 2.7 in particular. I found this very nice example
# chunked file reading
from __future__ import division
import os
def get_chunks(file_size):
chunk_start = 0
chunk_size = 0x20000 # 131072 bytes, default max ssl buffer size
while chunk_start + chunk_size < file_size:
yield(chunk_start, chunk_size)
chunk_start += chunk_size
final_chunk_size = file_size - chunk_start
yield(chunk_start, final_chunk_size)
def read_file_chunked(file_path):
with open(file_path) as file_:
file_size = os.path.getsize(file_path)
print('File size: {}'.format(file_size))
progress = 0
for chunk_start, chunk_size in get_chunks(file_size):
file_chunk = file_.read(chunk_size)
# do something with the chunk, encrypt it, write to another file...
progress += len(file_chunk)
print('{0} of {1} bytes read ({2}%)'.format(
progress, file_size, int(progress / file_size * 100))
)
if __name__ == '__main__':
read_file_chunked('some-file.gif')
(source: https://gist.github.com/richardasaurus/21d4b970a202d2fffa9c)
but something still is not very clear to me. For example, let's say that I write a piece of code and I want to test it on a small fraction of my dataset, just to check if it runs properly. How could I read, let's say, only the first 10% of my csv file and run my code on that chunk without having to store in the memory the rest of the dataset?
I appreciate any hint - even some reading or external reference is good, if related to chunking files with python. Thank you!
Let's consider the following CSV file:
If you open this CSV file with Notepad or any simple text editor you can see this:
CU-C2376;Airbus A380;50.00;259.00
J2-THZ;Boeing 737;233.00;213.00
SU-XBG;Embraer ERJ-195;356.00;189.00
TI-GGH;Boeing 737;39.00;277.00
HK-6754J;Airbus A380;92.00;93.00
6Y-VBU;Embraer ERJ-195;215.00;340.00
9N-ABU;Embraer ERJ-195;151.00;66.00
YV-HUI;Airbus A380;337.00;77.00
If you observe carefully each line corresponds to one row and each value is separated with a ";".
Let's say i want to read only the first three rows, then:
with open('data.csv') as f:
lines = list()
for i in range(3):
lines.append(f.readline())
#Do some stuff with the first three lines
This is a better way of reading chunk of file because let's the file is 10MB and if you read first 3MB the last byte you read may not represent anything.
Alternatively you can use libraries like panda..

How can I truncate an mp3 audio file by 30%?

I am trying to truncate an audio file by 30%, if the audio file was 4 minutes long, after truncating it, it should be around 72 seconds. I have written the code below to do it but it only returns a 0 byte file size. Please tell me where i went wrong?
def loadFile():
with open('music.mp3', 'rb') as in_file:
data = len(in_file.read())
with open('output.mp3', 'wb') as out_file:
ndata = newBytes(data)
out_file.write(in_file.read()[:ndata])
def newBytes(bytes):
newLength = (bytes/100) * 30
return int(newLength)
loadFile()
You are trying to read your file a second time which will result in no data, e.g. len(in_file.read(). Instead read the whole file into a variable and then calculate the length of that. The variable can then be used a second time.
def newBytes(bytes):
return (bytes * 70) / 100
def loadFile():
with open('music.mp3', 'rb') as in_file:
data = in_file.read()
with open('output.mp3', 'wb') as out_file:
ndata = newBytes(len(data))
out_file.write(data[:ndata])
Also it is better to multiply first and then divide to avoid having to work with floating point numbers.
You cannot reliably truncate an MP3 file by byte size and expect it to be equivalently truncated in audio time length.
MP3 frames can change bitrate. While your method will sort of work, it won't be all that accurate. Additionally, you'll undoubtedly break frames leaving glitches at the end of the file. You will also lose ID3v1 tags (if you still use them... better to use ID3v2 anyway).
Consider executing FFmpeg with -acodec copy instead. This will simply copy the bytes over while maintaining the integrity of the file, and ensuring a good clean cut where you want it to be.

Getting hash (digest) of a file in Python - reading whole file at once vs reading line by line

I need to get a hash (digest) of a file in Python.
Generally, when processing any file content it is adviced to process it gradually line by line due to memory concerns, yet I need a whole file to be loaded in order to obtain its digest.
Currently I'm obtaining hash in this way:
import hashlib
def get_hash(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
data = file.read()
h.update(data)
digest = h.hexdigest()
return digest
Is there any other way to perform this in more optimized or cleaner manner?
Is there any significant improvement in reading file gradually line by line over reading whole file at once when still the whole file must be loaded to calculate the hash?
According to the documentation for hashlib.update(), you don't need to concern yourself over the block size of different hashing algorithms. However, I'd test that a bit. But, it seems to check out, 512 is the block size of MD5, and if you change it to anything else, the results are the same as reading it all in at once.
import hashlib
def get_hash(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
data = file.read()
h.update(data)
digest = h.hexdigest()
return digest
def get_hash_memory_optimized(f_path, mode='md5'):
h = hashlib.new(mode)
with open(f_path, 'rb') as file:
block = file.read(512)
while block:
h.update(block)
block = file.read(512)
return h.hexdigest()
digest = get_hash('large_bin_file')
print(digest)
digest = get_hash_memory_optimized('large_bin_file')
print(digest)
> bcf32baa9b05ca3573bf568964f34164
> bcf32baa9b05ca3573bf568964f34164
Of course you can load data in chunks, so that the memory usage drops significantly as you no more have to load the whole file. Then you use hash.update(chunk) for each chunk:
from functools import partial
Hash = hashlib.new("sha1")
size = 128 # just an example
with open("data.txt", "rb") as File:
for chunk in iter(partial(f.read, size), b''):
Hash.update(chunk)
I find this iter trick very neat because it allows to write much cleaner code. It may look confusing at first, so I'll explain how it works:
iter(function, sentinel) executes function successively and yields the values it returns until one of them is equal to sentinel.
partial(f.read, size) returns a callable version of f.read(size). This is oversimplified, but still correct in this case.
You get the same result with both snippets:
h = hashlib.new("md5")
with open(filename,"rb") as f:
for line in f:
h.update(line)
print(h.hexdigest())
and
h = hashlib.new("md5")
with open(filename,"rb") as f:
h.update(f.read())
print(h.hexdigest())
A few notes:
the first approach works best with big text files, memory-wise. With binary file, there's no such thing as a "line". It will work, though, but a "chunk" approach is more regular (not going to paraphrase other answers)
the second approach eats a lot of memory if the file is big
in both cases, make sure that you open the file in binary mode, or end-of-line conversion could lead to wrong checksum (external tools would compute a different MD5 than your program)

segmenting and writing binary file using Python

I have two binary input files, firstfile and secondfile. secondfile is firstfile + additional material. I want to isolate this additional material in a separate file, newfile. This is what I have so far:
import os
import struct
origbytes = os.path.getsize(firstfile)
fullbytes = os.path.getsize(secondfile)
numbytes = fullbytes-origbytes
with open(secondfile,'rb') as f:
first = f.read(origbytes)
rest = f.read()
Naturally, my inclination is to do (which seems to work):
with open(newfile,'wb') as f:
f.write(rest)
I can't find it but thought I read on SO that I should pack this first using struct.pack before writing to file. The following gives me an error:
with open(newfile,'wb') as f:
f.write(struct.pack('%%%ds' % numbytes,rest))
-----> error: bad char in struct format
This works however:
with open(newfile,'wb') as f:
f.write(struct.pack('c'*numbytes,*rest))
And for the ones that work, this gives me the right answer
with open(newfile,'rb') as f:
test = f.read()
len(test)==numbytes
-----> True
Is this the correct way to write a binary file? I just want to make sure I'm doing this part correctly to diagnose if the second part of the file is corrupted as another reader program I am feeding newfile to is telling me, or I am doing this wrong. Thank you.
If you know that secondfile is the same as firstfile + appended data, why even read in the first part of secondfile?
with open(secondfile,'rb') as f:
f.seek(origbytes)
rest = f.read()
As for writing things out,
with open(newfile,'wb') as f:
f.write(rest)
is just fine. The stuff with struct would just be a no-op anyway. The only thing you might consider is the size of rest. If it could be large, you may want to read and write the data in blocks.
There is no reason to use the struct module, which is for converting between binary formats and Python objects. There's no conversion needed here.
Strings in Python 2.x are just an array of bytes and can be read and written to and from files. (In Python 3.x, the read function returns a bytes object, which is the same thing, if you open the file with open(filename, 'rb').)
So you can just read the file into a string, then write it again:
import os
origbytes = os.path.getsize(firstfile)
fullbytes = os.path.getsize(secondfile)
numbytes = fullbytes-origbytes
with open(secondfile,'rb') as f:
first = f.seek(origbytes)
rest = f.read()
with open(newfile,'wb') as f:
f.write(rest)
You don't need to read origbytes, just move file pointer to the right position: f.seek(numbytes)
You don't need struct packing, write rest to the newfile.
This is not c, there is no % in the format string. What you want is:
f.write(struct.pack('%ds' % numbytes,rest))
It worked for me:
>>> struct.pack('%ds' % 5,'abcde')
'abcde'
Explanation: '%%%ds' % 15 is '%15s', while what you want is '%ds' % 15 which is '15s'

Categories