I am trying to read data from this png image, and then place the image length at the start of the data, and pad it a given number of spaces defined by my header variable. However, once I do that, the image length increases drastically for a reason beyond my knowledge. Please can someone inform me of what is happening? I must be missing something since I am still fairly new to this field.
HEADER = 10
PATH = os.path.abspath("penguin.png")
print(PATH)
with open(PATH,"rb") as f:
imgbin = f.read()
print(len(imgbin))
imgbin = f"{len(imgbin):<{HEADER}}"+str(imgbin)
print(len(imgbin))
when I first print the length of the data, I get a length of 163287, and on the second print, I get a length of 463797
This is because you are changing the data from binary string to a string when you load the image to when you pass it through str:
len(imgbin), len(str(imgbin))
>>> (189255, 545639)
(note I use a different image so the numbers are different). You can solve this issue by adding a binary string to the start like so:
with open(PATH,"rb") as f:
imgbin = f.read()
imgbin = f"{len(imgbin):<{HEADER}}".encode('utf-8')+imgbin
print(len(imgbin))
>>> 189245
>>> 189255
You can find out more about binary strings here.
For reference it is worth noting that png images are uint-8 in type (i.e. 0-255). It is possible to manipulate them as binary strings because they can be utf-8 (i.e. the same size). However, it might be worth using something like numpy where you have uint-8 as a data type so as to avoid this.
Related
I have a binary file containing the position of 8000 particles.
I know that each particle value should look like "-24.6151..." (I don't know with which precision the values are given by my program. I guess it is double precision(?).
But when I try to read the file with this code:
In: with open('.//results0epsilon/energybinary/energy_00004.dat', 'br') as f:
buffer = f.read()
print ("Lenght of buffer is %d" % len(buffer))
for i in buffer:
print(int(i))
I get as output:
Lenght of buffer is 64000
10
168
179
43
...
I skip the whole list of values but as you can see those values are far away from what I expect. I think I have some kind of decoding error.
I would appreciate any kind of help :)
What you are printing now are the bytes composing your floating point data. So it doesn't make sense as numerical values.
Of course, there's no 100% sure answer since we didn't see your data, but I'll try to guess:
You have 8000 values to read and the file size is 64000. So you probably have double IEEE values (8 bytes each). If it's not IEEE, then you're toast.
In that case you could try the following:
import struct
with open('.//results0epsilon/energybinary/energy_00004.dat', 'br') as f:
buffer = f.read()
print ("Length of buffer is %d" % len(buffer))
data = struct.unpack("=8000d",buffer)
if the data is printed bogus, it's probably an endianness problem. So change the =8000 by <8000 or >8000.
for reference and packing/unpacking formats: https://docs.python.org/3/library/struct.html
I'm implementing Huffman Algorithm, but when I got the final compressed code, I've got a string similar to below:
10001111010010101010101
This is a binary code to created by the paths of my tree's leafs.
I have this sequence, but when I save it in a file, all that happens is system saving it as a ASCII on a file, which I can't compress because it has the same or bigger size than the original.
How do I save this binary directly?
PS: if I use some function to convert my string to binary, all I got is my ASCII converted to binary, so I did nothing, I need a real solution.
What you need to do is take each 8 bits and convert it into a byte to write out, looping until you have less than 8 bits remaining. Then save whatever's left over to prepend in front of the next value.
def binarize(bitstring):
wholebytes = len(bitstring) // 8
chars = [chr(int(bitstring[i*8:i*8+8], 2)) for i in range(wholebytes)]
remainder = bitstring[wholebytes*8:]
return ''.join(chars), remainder
I think you just want int() with a base value of 2:
my_string = "10001111010010101010101"
code_num = int( my_string, 2 )
Then write to a binary file. struct.pack additionally allows you to specify whatever byte order you like.
myfile = open("filename.txt",'wb')
mybytes = struct.pack( 'i', code_num )
myfile.write(mybytes)
myfile.close()
This method will also write some number of leading zeros, which could cause trouble for your Huffman codes.
I am aware that there are a lot of almost identical questions, but non seems to really target the general case.
So assume I want to open a file, read it in memory, possibly do some operations on the respective bitstring and write the result back to file.
The following is what seems straightforward to me, but it results in completely different output. Note that for simplicity I only copy the file here:
file = open('INPUT','rb')
data = file.read()
data_16 = data.encode('hex')
data_2 = bin(int(data_16,16))
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
I looked at data, data_16 and data_2. The transformations make sense as far as I can see.
As expected, the output file has exactly the same size in bits as the input file.
EDIT: I considered the possibility that the leading '0b' has to be cut. See the following:
>>> data[:100]
'BMFU"\x00\x00\x00\x00\x006\x00\x00\x00(\x00\x00\x00\xe8\x03\x00\x00\xee\x02\x00\x00\x01\x00\x18\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12\x0b\x00\x00\x12\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05=o\xce\xf4^\x16\xe0\x80\x92\x00\x00\x00\x01I\x02\x1d\xb5\x81\xcaN\xcb\xb8\x91\xc3\xc6T\xef\xcb\xe1j\x06\xc3;\x0c*\xb9Q\xbc\xff\xf6\xff\xff\xf7\xed\xdf'
>>> data_16[:100]
'424d46552200000000003600000028000000e8030000ee020000010018000000000000000000120b0000120b000000000000'
>>> data_2[:100]
'0b10000100100110101000110010101010010001000000000000000000000000000000000000000000011011000000000000'
>>> data_2[1]
'b'
Maybe the BMFU" part should be cut from data?
>>> bin(25)
'0b11001'
Note two things:
The "0b" at the beginning. This means that your slicing will be off by 2 bits.
The lack of padding to 8 bits. This will corrupt your data every time unless it happens to mesh up with point 1.
Process the file byte by byte instead of attempting to process it in one big gulp like this. If you find your code too slow then you need to find a faster way of working byte by byte, not switch to an irreparably flawed method such as this one.
You could simply write the data variable back out and you'd have a successful round trip.
But it looks like you intend to work on the file as a string of 0 and 1 characters. Nothing wrong with that (though it's rarely necessary), but your code takes a very roundabout way of converting the data to that form. Instead of building a monster integer and converting it to a bit string, just do so for one byte at a time:
data = file.read()
data_2 = "".join( bin(ord(c))[2:] for c in data )
data_2 is now a sequence of zeros and ones. (In a single string, same as you have it; but if you'll be making changes, I'd keep the bitstrings in a list). The reverse conversion is also best done byte by byte:
newdata = "".join(chr(int(byte, 8)) for byte in grouper(long_bitstring, 8, "0"))
This uses the grouper recipe from the itertools documentation.
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
You can use the struct module to read and write binary data. (Link to the doc here.)
EDIT
Sorry, I was mislead by your title. I’ve just understand that you write binary data in a text file instead of writing binary data directly.
Ok, thanks to alexis and being aware of Ignacio's warning about the padding, I found a way to do what I wanted to do, that is read data into a binary representation and write a binary representation to file:
def padd(bitstring):
padding = ''
for i in range(8-len(bitstring)):
padding += '0'
bitstring = padding + bitstring
return bitstring
file = open('INPUT','rb')
data = file.read()
data_2 = "".join( padd(bin(ord(c))[2:]) for c in data )
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
If I did not do it exactly the way proposed by alexis then that is because it did not work. Of course this is terribly slow but now that I can do the simplest thing, I can optimize it further.
I have written a code for joining two wave files.It works fine when i am joining larger segments but as i need to join very small segments the clarity is not good.
I have learned that the signal processing technique such a windowed join can be used to improve the joining of file.
y[n] = w[n]s[n]
Multiply value of signal at sample number n by the value of a windowing function
hamming window w[n]= .54 - .46*cos(2*Pi*n)/L 0
I am not understanding how to get the value to signal at sample n and how to implement this??
the code i am using for joining is
import wave
m=['C:/begpython/S0001_0002.wav', 'C:/begpython/S0001_0001.wav']
i=1
a=m[i]
infiles = [a, "C:/begpython/S0001_0002.wav", a]
outfile = "C:/begpython/S0001_00367.wav"
data= []
data1=[]
for infile in infiles:
w = wave.open(infile, 'rb')
data1=[w.getnframes]
data.append( [w.getparams(), w.readframes(w.getnframes())] )
#data1 = [ord(character) for character in data1]
#print data1
#data1 = ''.join(chr(character) for character in data1)
w.close()
output = wave.open(outfile, 'wb')
output.setparams(data[0][0])
output.writeframes(data[0][1])
output.writeframes(data[1][1])
output.writeframes(data[2][1])
output.close()
during joining i am manipulating using byte format for frames.now have to use integer or float format to perform operation on them i guess,if what i am thinking is true,how can i do this?
It's probably not the best solution, but I'm sure it will work. Maybe you find existing libs or so for some steps, I dont know for Python. The steps I suggest are:
Load the wave file.
Create the sample values (amplitude)
for each frame (depending on frame
size, litte/big endian,
signed/unsigned).
Divide the resulting array of int
values into windows, e.g. sample
0-511, 512-1023, ...
Perform the window function, for the
windows that you want to join.
Do your joining.
Store the windows back in a byte
array, the inverse operation of the
first step.
Old Post:
You have to calculate the sample value, in java a function for a 2 byte/frame soundfile would look like this:
public static int createIntFrom16( byte _8Bit1, byte _8Bit2 ) {
return ( 8Bit1<<8 ) | ( 8Bit2 &0x00FF );
}
Normally you will have to care about whether or not the file uses little endian, I don't know if the Python lib will take this into account.
Once you have created all sample values, you have to divide your file into windows, e.g. of size 512 samples. Then you can window the values, and create back the byte values. For 16bit it would look like this:
public static byte[] createBytesFromInt(int i) {
byte[] bytes = new byte[2];
bytes[0]=(byte)(i>>8);
bytes[1]=(byte)i;
return bytes;
}
To give you a high level understanding, WAV audio format consists of a 44 byte header where you define necessary meta data like sample rate, number of channels, etc. followed by the payload where the actual audio data lives. Audio is simply a curve of amplitude change over time. WAV format permits this amplitude to vary from a maximum value of +1.0 to minimum of -1.0 as expressed as a floating point. As an audio recording is made this amplitude is measured typically 44100 times per second (sample rate). So a WAV file just stores this series of sample values. The WAV format does NOT store floating points, instead it stores the range of +1 to -1 as integers ranging from 0 to 2^16. These 16 bit samples require two bytes of file storage per sample. In example code like above the i>>8 is shifting the audio values by 8 bits. If you think about these ideas, and write your own WAV format code to read or write from/to files you'll be well on your way to being able to answer your question.
I have practically no knowledge of Matlab, and need to translate some parsing routines into Python. They are for large files, that are themselves divided into 'blocks', and I'm having difficulty right from the off with the checksum at the top of the file.
What exactly is going on here in Matlab?
status = fseek(fid, 0, 'cof');
fposition = ftell(fid);
disp(' ');
disp(['** Block ',num2str(iBlock),' File Position = ',int2str(fposition)]);
% ----------------- Block Start ------------------ %
[A, count] = fread(fid, 3, 'uint32');
if(count == 3)
magic_l = A(1);
magic_h = A(2);
block_length = A(3);
else
if(fposition == file_length)
disp(['** End of file OK']);
else
disp(['** Cannot read block start magic ! Note File Length = ',num2str(file_length)]);
end
ok = 0;
break;
end
fid is the file currently being looked at
iBlock is a counter for which 'block' you're in within the file
magic_l and magic_h are to do with checksums later, here is the code for that (follows straight from the code above):
disp(sprintf(' Magic_L = %08X, Magic_H = %08X, Length = %i', magic_l, magic_h, block_length));
correct_magic_l = hex2dec('4D445254');
correct_magic_h = hex2dec('43494741');
if(magic_l ~= correct_magic_l | magic_h ~= correct_magic_h)
disp(['** Bad block start magic !']);
ok = 0;
return;
end
remaining_length = block_length - 3*4 - 3*4; % We read Block Header, and we expect a footer
disp(sprintf(' Remaining Block bytes = %i', remaining_length));
What is going on with the %08X and the hex2dec stuff?
Also, why specify 3*4 instead of 12?
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32'); in Python, as io.readline() is just pulling the first 3 characters of the file. Apologies if I'm missing the point somewhere here. It's just that using io.readline(3) on the file seems to return something it shouldn't, and I don't understand how the block_length can fit in a single byte when it could potentially be very long.
Thanks for reading this ramble. I hope you can understand kind of what I want to know! (Any insight at all is appreciated.)
Python Code for Reading a 1-Dimensional Array
When replacing Matlab with Python, I wanted to read binary data into a numpy.array, so I used numpy.fromfile to read the data into a 1-dimensional array:
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16)
Some advantages of using numpy.fromfile versus other Python solutions include:
Not having to manually determine the number of items to be read. You can specify them using the count= argument, but it defaults to -1 which indicates reading the entire file.
Being able to specify either an open file object (as I did above with fid) or you can specify a filename. I prefer using an open file object, but if you wanted to use a filename, you could replace the two lines above with:
data_array = numpy.fromfile(inputfilename, numpy.int16)
Matlab Code for a 2-Dimensional Array
Matlab's fread has the ability to read the data into a matrix of form [m, n] instead of just reading it into a column vector. For instance, to read data into a matrix with 2 rows use:
fid = fopen(inputfilename, 'r');
data_array = fread(fid, [2, inf], 'int16');
fclose(fid);
Equivalent Python Code for a 2-Dimensional Array
You can handle this scenario in Python using Numpy's shape and transpose.
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16).reshape((-1, 2)).T
The -1 tells numpy.reshape to infer the length of the array for that dimension based on the other dimension—the equivalent of Matlab's inf infinity representation.
The .T transposes the array so that it is a 2-dimensional array with the first dimension—the axis—having a length of 2.
From the documentation of fread, it is a function to read binary data. The second argument specifies the size of the output vector, the third one the size/type of the items read.
In order to recreate this in Python, you can use the array module:
f = open(...)
import array
a = array.array("L") # L is the typecode for uint32
a.fromfile(f, 3)
This will read read three uint32 values from the file f, which are available in a afterwards. From the documentation of fromfile:
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
Arrays implement the sequence protocol and therefore support the same operations as lists, but you can also use the .tolist() method to create a normal list from the array.
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32');
In Matlab, one of fread()'s signatures is fread(fileID, sizeA, precision). This reads in the first sizeA elements (not bytes) of a file, each of a size sufficient for precision. In this case, since you're reading in uint32, each element is of size 32 bits, or 4 bytes.
So, instead, try io.readline(12) to get the first 3 4-byte elements from the file.
The first part is covered by Torsten's answer... you're going to need array or numarray to do anything with this data anyway.
As for the %08X and the hex2dec stuff, %08X is just the print format for those unit32 numbers (8 digit hex, exactly the same as Python), and hex2dec('4D445254') is matlab for 0x4D445254.
Finally, ~= in matlab is a bitwise compare; use == in Python.