Read binary file: Matlab differs from R and python - python

I am trying to recreate a Matlab script that reads a binary file in either R or python. Here is a link to a the test.bin data: https://github.com/AndrewHFarkas/BinaryReadTest
Matlab
FilePath = 'test.bin'
fid=fopen(FilePath,'r','b');
VersionStr=fread(fid,7,'char')
Version=fread(fid,1,'int16')
SizeFormat='float32'
DataFormat='float32'
EegMegStatus=fread(fid,1,SizeFormat)
NChanExtra=fread(fid,1,SizeFormat)
TrigPoint=fread(fid,1,'float32')
DataTypeVal=fread(fid,1,'float32')
TmpSize=fread(fid,2,SizeFormat)
AvgMat=fread(fid,1,DataFormat)
Matlab output:
VersionStr =
86
101
114
115
105
111
110
Version =
8
SizeFormat =
'float32'
DataFormat =
'float32'
EegMegStatus =
1
NChanExtra =
0
TrigPoint =
1
DataTypeVal =
0
TmpSize =
65
1076
AvgMat =
-12.9650
This is my closest attempt with python (I found some of this code from a different stackoverflow answer:
import numpy as np
import array
def fread(fid, nelements, dtype):
if dtype is np.str:
dt = np.uint8 # WARNING: assuming 8-bit ASCII for np.str!
else:
dt = dtype
data_array = np.fromfile(fid, dt, nelements)
data_array.shape = (nelements, 1)
return data_array
fid = open('test.bin', 'rb');
print(fread(fid, 7, np.str)) # so far so good!
[[ 86]
[101]
[114]
[115]
[105]
[111]
[110]]
#both of these options return 2048
print(fread(fid, 1, np.int16))
np.fromfile(fid, np.int16, 1)
And no matter what else I've tried I can't get any of the same numbers past that point. I have tried using little and big endian settings, but maybe not correctly.
If it helps, here is my closest attempt in R:
newdata = file("test.bin", "rb")
version_char = readBin(newdata, "character", n=1)
version_char
[1] "Version" # this makes sense because the first 7 bytes to spell Version
version_num = readBin(newdata, "int", size = 1 , n = 1, endian = "little")
version_num
[1] 8 #correct number
And nothing after that matches. This is were I get really confused because I was only able to get 8 with a (byte) size = 1 for the version_num, but an int16 should be two bytes as far as I understand. I have tried this code below to read in a float as suggested in another post:
readBin(newdata, "double", size = 4 , n = 1, endian = "little")
Thank you all for your time

struct is in general what you would use to unpack binary data
>>> keys = "VStr vNum EegMeg nChan trigPoint dType tmpSize[0] tmpSize[1] avgMat".split()
>>> dict(zip(keys,struct.unpack_from(">7sh7f",open("test.bin","rb").read())))
{'nChan': 0.0, 'EegMeg': 1.0, 'dType': 0.0, 'trigPoint': 1.0, 'VStr': 'Version', 'avgMat': -12.964995384216309, 'vNum': 8, 'tmpSize[0]': 65.0, 'tmpSize[1]': 1076.0}
the unpack string says unpack a string of 7 characters, followed by a short int followed by 7 float32

Related

Little to big endian buffer at once python [duplicate]

This question already has answers here:
Efficient way to swap bytes in python
(5 answers)
Closed 4 months ago.
I've created a buffer of words represented in little endian(Assuming each word is 2 bytes):
A000B000FF0A
I've separated the buffer to 3 words(2 bytes each)
A000
B000
FF0A
and after that converted to big endian representation:
00A0
00B0
0AFF
Is there a way instead of split into words to represent the buffer in big endian at once?
Code:
buffer='A000B000FF0A'
for i in range(0, len(buffer), 4):
value = endian(int(buffer[i:i + 4], 16))
def endian(num):
p = '{{:0{}X}}'.format(4)
hex = p.format(num)
bin = bytearray.fromhex(hex).reverse()
l = ''.join(format(x, '02x') for x in bin)
return int(l, 16)
Using the struct or array libraries are probably the easiest ways to do this.
Converting the hex string to bytes first is needed.
Here is an example of how it could be done:
from array import array
import struct
hex_str = 'A000B000FF0A'
raw_data = bytes.fromhex(hex_str)
print("orig string: ", hex_str.casefold())
# With array lib
arr = array('h')
arr.frombytes(raw_data)
# arr = array('h', [160, 176, 2815])
arr.byteswap()
array_str = arr.tobytes().hex()
print(f"Swap using array: ", array_str)
# With struct lib
arr2 = [x[0] for x in struct.iter_unpack('<h', raw_data)]
# arr2 = [160, 176, 2815]
struct_str = struct.pack(f'>{len(arr2) * "h"}', *arr2).hex()
print("Swap using struct:", struct_str)
Gives transcript:
orig string: a000b000ff0a
Swap using array: 00a000b00aff
Swap using struct: 00a000b00aff
You can use the struct to interpret your bytes as big or little endian. Then you can use the hex() method of the bytearray object to have a nice string representation.
Docs for struct.
import struct
# little endian
a = struct.pack("<HHH",0xA000,0xB000,0xFF0A)
# bih endian
b = struct.pack(">HHH",0xA000,0xB000,0xFF0A)
print(a)
print(b)
# convert back to string
print( a.hex() )
print( b.hex() )
Which gives:
b'\x00\xa0\x00\xb0\n\xff'
b'\xa0\x00\xb0\x00\xff\n'
00a000b00aff
a000b000ff0a

Define a BitStruct inside a BitStruct in python using construct package

I am trying to read the following format (Intel -> Little-Endian):
X: 0 -> 31, size 32 bits
Offset: 32 -> 43, size 12 bits
Index: 44 -> 47, size 4 bits
Time: 48 -> 55, size 8 bits
Radius: 56 -> 63, size 8 bits
For this parser I defined:
from construct import Bitwise, BitStruct, BitsInteger
from construct import Int32sl, Int8ul
BitStruct( "X" / Bytewise(Int32sl),
"Offset" / BitsInteger(12),
"Index" / BitsInteger(4),
"Time" / Bytewise(Int8ul),
"Radius" / Bytewise(Int8ul),
)
from the folloing bytes:
bytearray(b'\xca\x11\x01\x00\x00\x07\xffu')
What I get is:
Container:
X = 70090
Offset= 0
Index = 7
Time = 255
Radius = 117
What I should have gotten is:
Container:
X = 70090
Offset = 1792
Index = 0
Time = 255
Radius= 117
As you can see, the values of Offset and Index that I get do not match with the expected values, the rest is correct.
From what I saw, i need to swap the two byes, which contains the Offset and Index values.
How could I define a struct inside a struct and swap the two bytes as well?
BitsInteger treats as Big Endian by default.
From the documentation on BitsInteger
Note that little-endianness is only defined for multiples of 8 bits.
You must set default parameter swapped to True.
swapped – bool, whether to swap byte order (little endian), default is False (big endian)
As such :
BitStruct( "X" / Bytewise(Int32sl),
"Offset" / BitsInteger(12, swapped=True),
"Index" / BitsInteger(4, swapped=True),
"Time" / Bytewise(Int8ul),
"Radius" / Bytewise(Int8ul),
)
BUT you are not using multiples of 8 so you should just swap around the initial byte array and be done with it.
By swapping the bytes in bytearray and order of the variables in the BitStruct I am able to get the correct values.
from construct import Bitwise, BitStruct, BitsInteger, Bytewise
from construct import Int32sb, Int8ul
data = bytearray(b'\xca\x11\x01\x00\x00\x07\xffu')
data_reverse = data[::-1]
format = BitStruct( "Radius" / Bytewise(Int8ul),
"Time" / Bytewise(Int8ul),
"Index" / BitsInteger(4),
"Offset" / BitsInteger(12),
"X" / Bytewise(Int32sb),
)
print(format.parse(data_reverse))
return:
Container:
Radius = 117
Time = 255
Index = 0
Offset = 1792
X = 70090
If someone have a better solution I would be more then happy to hear.

Audio data string format to numpy array

I am trying to convert audio sample rate (from 44100 to 22050) of a numpy.array with 88200 samples in which I have already done some process (such as add silence and convert to mono). I tried to convert this array with audioop.ratecv and it work, but it return a str instead of a numpy array and when I wrote those data with scipy.io.wavfile.write the result was half of the data are lost and the audio speed is twice as fast (instead of slower, at least that would make kinda sense).
audio.ratecv works fine with str arrays such as wave.open returns, but I don't know how to process those, so I tried to convert from str to numpy with numpy.array2string(data) to pass this on ratecv and get correct results, and then convert again to numpy with numpy.fromstring(data, dtype) and now len of data is 8 samples. I think this is due to complication of formats, but I don't know how can I control it. I also haven't figure out what kind of format str does wave.open returns so I can force format on this one.
Here is this part of my code
def conv_sr(data, srold, fixSR, dType, chan = 1):
state = None
width = 2 # numpy.int16
print "data shape", data.shape, type(data[0]) # returns shape 88200, type int16
fragments = numpy.array2string(data)
print "new fragments len", len(fragments), "type", type(fragments) # return len 30 type str
fragments_new, state = audioop.ratecv(fragments, width, chan, srold, fixSR, state)
print "fragments", len(fragments_new), type(fragments_new[0]) # returns 16, type str
data_to_return = numpy.fromstring(fragments_new, dtype=dType)
return data_to_return
and I call it like this
data1 = numpy.array(data1, dtype=dType)
data_to_copy = numpy.append(data1, data2)
data_to_copy = _to_copy.sum(axis = 1) / chan
data_to_copy = data_to_copy.flatten() # because its mono
data_to_copy = conv_sr(data_to_copy, sr, fixSR, dType) #sr = 44100, fixSR = 22050
scipy.io.wavfile.write(filename, fixSR, data_to_copy)
After a bit more of research I found my mistake, it seems that 16 bit audio are made of two 8 bit 'cells', so the dtype I was putting on was false and that's why I had audio speed issue. I found the correct dtype here. So, in conv_sr def, I am passing a numpy array, convert it to data string, pass it to convert sample rate, converting again to numpy array for scipy.io.wavfile.write and finally, converting 2 8bits to 16 bit format
def widthFinder(dType):
try:
b = str(dType)
bits = int(b[-2:])
except:
b = str(dType)
bits = int(b[-1:])
width = bits/8
return width
def conv_sr(data, srold, fixSR, dType, chan = 1):
state = None
width = widthFinder(dType)
if width != 1 and width != 2 and width != 4:
width = 2
fragments = data.tobytes()
fragments_new, state = audioop.ratecv(fragments, width, chan, srold, fixSR, state)
fragments_dtype = numpy.dtype((numpy.int16, {'x':(numpy.int8,0), 'y':(numpy.int8,1)}))
data_to_return = numpy.fromstring(fragments_new, dtype=fragments_dtype)
data_to_return = data_to_return.astype(dType)
return data_to_return
If you find anything wrong, please feel free to correct me, I am still a learner

Effienctly unpack mono12packed bitstring format with python

I have raw data from a camera, which is in the mono12packed format. This is an interlaced bit format, to store 2 12bit integers in 3 bytes to eliminate overhead. Explicitly the memory layout for each 3 bytes looks like this:
Byte 1 = Pixel0 Bits 11-4
Byte 2 = Pixel1 Bits 3-0 + Pixel0 Bits 3-0
Byte 3 = Pixel1 Bits 11-4
I have a file, where all the bytes can be read from using binary read, let's assume it is called binfile.
To get the pixeldata from the file I do:
from bitstring import BitArray as Bit
f = open(binfile, 'rb')
bytestring = f.read()
f.close()
a = []
for i in range(len(bytestring)/3): #reading 2 pixels = 3 bytes at a time
s = Bit(bytes = bytestring[i*3:i*3+3], length = 24)
p0 = s[0:8]+s[12:16]
p1 = s[16:]+s[8:12]
a.append(p0.unpack('uint:12'))
a.append(p1.unpack('uint:12'))
which works, but is horribly slow and I would like to do that more efficiently, because I have to do that for a huge amount of data.
My idea is, that by reading more than 3 bytes at a time I could spare some time in the conversion step, but I can't figure a way how to do that.
Another idea is, since the bits come in packs of 4, maybe there is a way to work on nibbles rather than on bits.
Data example:
The bytes
'\x07\x85\x07\x05\x9d\x06'
lead to the data
[117, 120, 93, 105]
Have you tried bitwise operators? Maybe that's a faster way:
with open('binfile.txt', 'rb') as binfile:
bytestring = list(bytearray(binfile.read()))
a = []
for i in range(0, len(bytestring), 3):
px_bytes = bytestring[i:i+3]
p0 = (px_bytes[0] << 4) | (px_bytes[1] & 0x0F)
p1 = (px_bytes[2] << 4) | (px_bytes[1] >> 4 & 0x0F)
a.append(p0)
a.append(p1)
print a
This also outputs:
[117, 120, 93, 105]
Hope it helps!

Can you use numpy's fromfile to store files in an ndarray?

I'm still pretty new to python. I've been able to write a program that will read in a file from binary and stores the data that's there in a few arrays. Now that I've been able to complete a few other tasks with this program, I'm trying to go back through all my code and see where I can make it more efficient, learning Python better along the way. In particular, I'm trying to update the reading and storing of data from the file. Using numpy's fromfile is MUCH, MUCH faster at unpacking data than the struct.unpack method, and works wonderfully for a 1D array structure. However, I have some of the data stored in 2D arrays. I am seemingly stuck on how to implement the same type of storing in the 2D array. Does anyone have any ideas or hints as to how I may be able to perform this?
My basic program structure is as follows:
from numpy import fromfile
import numpy as np
file = open(theFilePath,'rb')
####### File Header #########
reservedParse = 4
fileHeaderBytes = 4 + int(131072/reservedParse) #Parsing out the bins in the file header
fileHeaderArray = np.zeros(fileHeaderBytes)
fileHeaderArray[0] = fromfile(file, dtype='<I', count=1) #File Index; 4 Bytes
fileHeaderArray[1] = fromfile(file, dtype='<I', count=1) #The Number of Packets; 4 bytes
fileHeaderArray[2] = fromfile(file, dtype='<Q', count=1) #Timestamp; 16 bytes; 2, 8-byte.
fileHeaderArray[3] = fromfile(file, dtype='<Q', count=1)
fileHeaderArray[4:] = fromfile(file, dtype='<I', count=int(131072/reservedParse)) #Empty header space
####### Data Packets #########
#Note: Each packet begins with a header containing information about the data stream followed by the data.
packets = int(fileHeaderArray[1]) #The number of packets in the data stream
dataLength = int(28672)
packHeader = np.zeros(14*packets).reshape((packets,14))
data = np.zeros(dataLength*packets).reshape((packets,dataLength))
for i in range(packets):
packHeader[i][0] = fromfile(file, dtype='>H', count=1) #Model Num
packHeader[i][1] = fromfile(file, dtype='>H', count=1) #Packet ID
packHeader[i][2] = fromfile(file, dtype='>I', count=1) #Coarse Timestamp
....#Continuing on
packHeader[i][13] = fromfile(file, dtype='>i', count=1) #4 bytes of reserved space
data[i] = fromfile(file, dtype='<h', count=dataLength) #Actual data
Essentially this is what I have right now. Is there a way I can do this without doing the loop? Going through that loop does not seem particularly fast or numpy-ish.
For reference, the for-loop structure using unpack and not numpy is:
packHeader = [[0 for x in range(14)] for y in range(packets)]
data = [[0 for x in range(dataLength)] for y in range(packets)]
for i in range(packets):
packHeader[i][0] = unpack('>H', file.read(2)) #Model Num
packHeader[i][1] = unpack('>H', file.read(2)) #Packet ID
packHeader[i][2] = unpack('>I', file.read(4)) #Coarse Timestamp
....#Continuing on
packHeader[i][13] = unpack('>i', file.read(4)) #4 bytes of reserved space
packHeader[i]=list(chain(*packHeader[i])) #Deals with the tuple issue ((x,),(y,),...) -> (x,y,...)
data[i] = [unpack('<h', file.read(2)) for j in range(dataLength)] #Actual data
data[i] = list(chain(*data[i])) #Deals with the tuple issue ((x,),(y,),...) -> (x,y,...)
Let me know if any clarification is needed.

Categories