export matlab variable to text for python usage - python

So let's start off by saying I'm a total beginner in matlab. I'm working with python and now I've recieved some data in a matlab file that I need to export to a format I could use with python.
I've googled around and found I can export a matlab variable to a text file using:
dlmwrite('my_text', MyVariable, 'delimiter' , ',');
Now the variable I need to export is a 16000 x 4000 matrix of doubles of the form 0.006747668446927. Now here is where the problem starts. I need to export the full values for each double. Trying with that function lead me to export the numbers in a format of 0.0067477. This won't do since I need a whole lot more of precision for what I'm doing. So how can I export the full values of each of these variables? Or if you have a more elegant way of using that huge matlab matrix in python please feel free.
Regards,
Bogdan

To exchange big chunks of numerical data between Python and Matlab I
recommend HDF5
http://en.wikipedia.org/wiki/Hierarchical_Data_Format
The Python binding is called h5py
http://code.google.com/p/h5py
Here are two examples for both directions. First from
Matlab to Python
% matlab
points = [1 2 3 ; 4 5 6 ; 7 8 9 ; 10 11 12 ];
hdf5write('test.h5', '/Points', points);
# python
import h5py
with h5py.File('test.h5', 'r') as f:
points = f['/Points'].value
And now from Python to Matlab
# python
import h5py
import numpy
points = numpy.array([ [1., 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ])
with h5py.File('test.h5', 'w') as f:
f['/Points'] = points
% matlab
points = hdf5read('test.h5', '/Points');
NOTE A column in Matlab will come out as a row in Python and vice versa. This isn't a bug but the difference between the way C and Fortran interpret a continuous piece of data in memory.

Scipy has tools for reading MATLAB .mat files natively: see e.g. http://www.janeriksolem.net/2009/05/reading-and-writing-mat-files-with.html

While I like the hdf5 based answer, I still think text files and CSVs are nice for smaller things (you can open them in text editors, spreadsheets whatever). In that case I would use MATLABs fopen/fprintf/fclose rather than dlmwrite - I like to make things explicit. Then again, this dlmwrite might be better for multi-dimensional arrays.

You can simply write your variable to file as binary data, then read it in any language you want, be it MATLAB, Python, C, etc.. Example:
MATLAB (write)
X = rand([100 1],'single');
fid = fopen('file.bin', 'wb');
count = fwrite(fid, X, 'single');
fclose(fid);
MATLAB (read)
fid = fopen('file.bin', 'rb');
data = fread(fid, Inf, 'single=>single');
fclose(fid);
Python (read)
import struct
data = []
f = open("file.bin", "rb")
try:
# read 4 bytes at a time (float)
bytes = f.read(4) # returns a sequence of bytes as a string
while bytes != "":
# string byte-sequence to float
num = struct.unpack('f',bytes)[0]
# append to list
data.append(num);
# read next 4 bytes
bytes = f.read(4)
finally:
f.close()
# print list
print data
C (read)
#include <stdio.h>
#include <stdlib.h>
int main()
{
FILE *fp = fopen("file.bin", "rb");
// Determine size of file
fseek(fp, 0, SEEK_END);
long int lsize = ftell(fp);
rewind(fp);
// Allocate memory, and read file
float *numbers = (float*) malloc(lsize);
size_t count = fread(numbers, 1, lsize, fp);
fclose(fp);
// print data
int i;
int numFloats = lsize / sizeof(float);
for (i=0; i<numFloats; i+=1) {
printf("%g\n", numbers[i]);
}
return 0;
}

Related

Python - Efficient way to flip bytes in a file?

I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack

Reading wav files - Native int16 values not doubles

Is it possible to extract the same values as Python does in their library?
For example, I'm using C++ and have managed to extract values of type double from a .wav file, however, when working in Python the values are in int16 format. Here is what I have so far:
uint16_t c;
for(unsigned i=0; (i < size); i++)
{
c = (unsigned)(data[i]);
rawSignal.push_back(c);
}
This does not work, because when reading the wav file in using Python:
w = wave.open(wavefile,"rb")
p = w.getparams()
s = w.readframes(p[3])
w.close()
sd = np.fromstring(s, np.int16)
I get the following graph displayed:
However using the method above, I get the following:
NOTE: When using double values, it displays correctly. So is there an effective way to convert the double into the 'uint16_t' format?

Python Struct to Unpack like C fread using C structures?

I am struggling to port a snippet of code from C (originally Fortran) to python to unpack the header of a binary file.
The C is:
fread(&hdr, hdrSize, 1, modelFile);
alb = hdr.fd[0];
skrc = hdr.fd[2];
taud = hdr.fd[16];
djul = hdr.fd[40];
deljul = hdr.fd[41];
n4 = hdr.id[3];
n5 = hdr.id[4];
n24 = hdr.id[5];
jdisk = hdr.id[11];
k4out = hdr.id[16];
j5 = hdr.id[39];
The file has been written using Fortran where, for example, alb is defined as (F10.2) and n4 is defined as (I10).
I tried using both the struct module, numpy.fromfile(), and numpy.fromstring(). I know the total header size, so using numpy.fromstring() I can read the binary (open(file, 'r+b') and generate an array that is significantly longer than expected - This is backfilled with 0.0 which I believe makes sense if the header is a fixed size and the fill data is smaller.
Is the struct module the way to go to access the data in the same manner that fread() does? In the C code, I believe that the .fd[0] indicates that we are looking at structure 0? If so, is that format duplicatable in Python?
Edit:
I am not able to find .id or .fd in the C code, but the original fortran defines them as
REAL*4 FD(80)
INTEGER*4 ID(40)
The rest of the port from Fortran to C is essentially verbatim, so I would assume that the declarations are identical.
Edit2:
The following appears to be working - in that the variable values are logical. This is using the struct module.
with open(sys.argv[1], 'r+b') as l:
headdata = l.read(HEAD_BYTES)
fd_arr = struct.unpack('f'*80, headdata[0:320])
print fd_arr[0]
print fd_arr[2]
print fd_arr[16]
print fd_arr[40]
print fd_arr[41]
id_arr = struct.unpack('i'*40, headdata[320:480])
print id_arr[3]
print id_arr[4]
print id_arr[5]
print id_arr[11]
print id_arr[16]
print id_arr[39]
I will explore using a single format string to see the impact of mixed types within the string.

reading struct in python from created struct in c

I am very new at using Python and very rusty with C, so I apologize in advance for how dumb and/or lost I sound.
I have function in C that creates a .dat file containing data. I am opening the file using Python to read the file. One of the things I need to read are a struct that was created in the C function and printed in binary. In my Python code I am at the appropriate line of the file to read in the struct. I have tried both unpacking the stuct item by item and as a whole without success. Most of the items in the struct were declared 'real' in the C code. I am working on this code with someone else and the main source code is his and has declared the variables as 'real'. I need to put this in a loop because I want to read all of the files in the directory that end in '.dat'. To start the loop I have:
for files in os.listdir(path):
if files.endswith(".dat"):
part = open(path + files, "rb")
for line in part:
Which then I read all of the lines previous to the one containing the struct. Then I get to that line and have:
part_struct = part.readline()
r = struct.unpack('<d8', part_struct[0])
I'm trying to just read the first thing stored in the struct. I saw an example of this somewhere on here. And when I try this I'm getting an error that reads:
struct.error: repeat count given without format specifier
I will take any and all tips someone can give me. I have been stuck on this for a few days and have tried many different things. To be honest, I think I don't understand the struct module but I've read as much as I could on it.
Thanks!
You could use ctypes.Structure or struct.Struct to specify format of the file. To read structures from the file produced by C code in #perreal's answer:
"""
struct { double v; int t; char c;};
"""
from ctypes import *
class YourStruct(Structure):
_fields_ = [('v', c_double),
('t', c_int),
('c', c_char)]
with open('c_structs.bin', 'rb') as file:
result = []
x = YourStruct()
while file.readinto(x) == sizeof(x):
result.append((x.v, x.t, x.c))
print(result)
# -> [(12.100000381469727, 17, 's'), (12.100000381469727, 17, 's'), ...]
See io.BufferedIOBase.readinto(). It is supported in Python 3 but it is undocumented in Python 2.7 for a default file object.
struct.Struct requires to specify padding bytes (x) explicitly:
"""
struct { double v; int t; char c;};
"""
from struct import Struct
x = Struct('dicxxx')
with open('c_structs.bin', 'rb') as file:
result = []
while True:
buf = file.read(x.size)
if len(buf) != x.size:
break
result.append(x.unpack_from(buf))
print(result)
It produces the same output.
To avoid unnecessary copying Array.from_buffer(mmap_file) could be used to get an array of structs from a file:
import mmap # Unix, Windows
from contextlib import closing
with open('c_structs.bin', 'rb') as file:
with closing(mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_COPY)) as mm:
result = (YourStruct * 3).from_buffer(mm) # without copying
print("\n".join(map("{0.v} {0.t} {0.c}".format, result)))
Some C code:
#include <stdio.h>
typedef struct { double v; int t; char c;} save_type;
int main() {
save_type s = { 12.1f, 17, 's'};
FILE *f = fopen("output", "w");
fwrite(&s, sizeof(save_type), 1, f);
fwrite(&s, sizeof(save_type), 1, f);
fwrite(&s, sizeof(save_type), 1, f);
fclose(f);
return 0;
}
Some Python code:
import struct
with open('output', 'rb') as f:
chunk = f.read(16)
while chunk != "":
print len(chunk)
print struct.unpack('dicccc', chunk)
chunk = f.read(16)
Output:
(12.100000381469727, 17, 's', '\x00', '\x00', '\x00')
(12.100000381469727, 17, 's', '\x00', '\x00', '\x00')
(12.100000381469727, 17, 's', '\x00', '\x00', '\x00')
but there is also the padding issue. The padded size of save_type is 16, so we read 3 more characters and ignore them.
A number in the format specifier means a repeat count, but it has to go before the letter, like '<8d'. However you said you just want to read one element of the struct. I guess you just want '<d'. I guess you are trying to specify the number of bytes to read as 8, but you don't need to do that. d assumes that.
I also noticed you are using readline. That seems wrong for reading binary data. It will read until the next carriage return / line feed, which will occur randomly in binary data. What you want to do is use read(size), like this:
part_struct = part.read(8)
r = struct.unpack('<d', part_struct)
Actually, you should be careful, as read can return less data than you request. You need to repeat it if it does.
part_struct = b''
while len(part_struct) < 8:
data = part.read(8 - len(part_struct))
if not data: raise IOException("unexpected end of file")
part_struct += data
r = struct.unpack('<d', part_struct)
I had same problem recently, so I had made module for the task, stored here: http://pastebin.com/XJyZMyHX
example code:
MY_STRUCT="""typedef struct __attribute__ ((__packed__)){
uint8_t u8;
uint16_t u16;
uint32_t u32;
uint64_t u64;
int8_t i8;
int16_t i16;
int32_t i32;
int64_t i64;
long long int lli;
float flt;
double dbl;
char string[12];
uint64_t array[5];
} debugInfo;"""
PACKED_STRUCT='\x01\x00\x01\x00\x00\x01\x00\x00\x00\x00\x00\x01\x00\x00\x00\xff\x00\xff\x00\x00\xff\xff\x00\x00\x00\x00\xff\xff\xff\xff*\x00\x00\x00\x00\x00\x00\x00ff\x06#\x14\xaeG\xe1z\x14\x08#testString\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00'
if __name__ == '__main__':
print "String:"
print depack_bytearray_to_str(PACKED_STRUCT,MY_STRUCT,"<" )
print "Bytes in Stuct:"+str(structSize(MY_STRUCT))
nt=depack_bytearray_to_namedtuple(PACKED_STRUCT,MY_STRUCT,"<" )
print "Named tuple nt:"
print nt
print "nt.string="+nt.string
The result should be:
String:
u8:1
u16:256
u32:65536
u64:4294967296
i8:-1
i16:-256
i32:-65536
i64:-4294967296
lli:42
flt:2.09999990463
dbl:3.01
string:u'testString\x00\x00'
array:(1, 2, 3, 4, 5)
Bytes in Stuct:102
Named tuple nt:
CStruct(u8=1, u16=256, u32=65536, u64=4294967296L, i8=-1, i16=-256, i32=-65536, i64=-4294967296L, lli=42, flt=2.0999999046325684, dbl=3.01, string="u'testString\\x00\\x00'", array=(1, 2, 3, 4, 5))
nt.string=u'testString\x00\x00'
Numpy can be used to read/write binary data. You just need to define a custom np.dtype instance that defines the memory layout of your c-struct.
For example, here is some C++ code defining a struct (should work just as well for C structs, though I'm not a C expert):
struct MyStruct {
uint16_t FieldA;
uint16_t pad16[3];
uint32_t FieldB;
uint32_t pad32[2];
char FieldC[4];
uint64_t FieldD;
uint64_t FieldE;
};
void write_struct(const std::string& fname, MyStruct h) {
// This function serializes a MyStruct instance and
// writes the binary data to disk.
std::ofstream ofp(fname, std::ios::out | std::ios::binary);
ofp.write(reinterpret_cast<const char*>(&h), sizeof(h));
}
Based on the advice I found at stackoverflow.com/a/5397638, I've included some padding in the struct (the pad16 and pad32 fields) so that serialization will happen in a more predictable way. I think that this is a C++ thing; it might not be necessary when using plain ol' C structs.
Now, in python, we create a numpy.dtype object describing the memory-layout of MyStruct:
import numpy as np
my_struct_dtype = np.dtype([
("FieldA" , np.uint16 , ),
("pad16" , np.uint16 , (3,) ),
("FieldB" , np.uint32 ),
("pad32" , np.uint32 , (2,) ),
("FieldC" , np.byte , (4,) ),
("FieldD" , np.uint64 ),
("FieldE" , np.uint64 ),
])
Then use numpy's fromfile to read the binary file where you've saved your c-struct:
# read data
struct_data = np.fromfile(fpath, dtype=my_struct_dtype, count=1)[0]
FieldA = struct_data["FieldA"]
FieldB = struct_data["FieldB"]
FieldC = struct_data["FieldC"]
FieldD = struct_data["FieldD"]
FieldE = struct_data["FieldE"]
if FieldA != expected_value_A:
raise ValueError("Bad FieldA, got %d" % FieldA)
if FieldB != expected_value_B:
raise ValueError("Bad FieldB, got %d" % FieldB)
if FieldC.tobytes() != b"expc":
raise ValueError("Bad FieldC, got %s" % FieldC.tobytes().decode())
...
The count=1 argument in the above call np.fromfile(..., count=1) is so that the returned array will have only one element; this means "read the first struct instance from the file". Note that I am indexing [0] to get that element out of the array.
If you have appended the data from many c-structs to the same file, you can use fromfile(..., count=n) to read n struct instances into a numpy array of shape (n,). Setting count=-1, which is the default for the np.fromfile and np.frombuffer functions, means "read all of the data", resulting in a 1-dimensional array of shape (number_of_struct_instances,).
You can also use the offset keyword argument to np.fromfile to control where in the file the data read will begin.
To conclude, here are some numpy functions that will be useful once your custom dtype has been defined:
Reading binary data as a numpy array:
np.frombuffer(bytes_data, dtype=...):
Interpret the given binary data (e.g. a python bytes instance)
as a numpy array of the given dtype. You can define a custom
dtype that describes the memory layout of your c struct.
np.fromfile(filename, dtype=...):
Read binary data from filename. Should be the same result as
np.frombuffer(open(filename, "rb").read(), dtype=...).
Writing a numpy array as binary data:
ndarray.tobytes():
Construct a python bytes instance containing
raw data from the given numpy array. If the array's data has dtype
corresponding to a c-struct, then the bytes coming from
ndarray.tobytes can be deserialized
by c/c++ and interpreted as an (array of) instances of that c-struct.
ndarray.tofile(filename):
Binary data from the array is written to filename.
This data could then be deserialized by c/c++.
Equivalent to open("filename", "wb").write(a.tobytes()).

Python Fast Input Output Using Buffer Competitive Programming

I have seen people using buffer in different languages for fast input/output in Online Judges. For example this http://www.spoj.pl/problems/INTEST/ is done with C like this:
#include <stdio.h>
#define size 50000
int main (void){
unsigned int n=0,k,t;
char buff[size];
unsigned int divisible=0;
int block_read=0;
int j;
t=0;
scanf("%lu %lu\n",&t,&k);
while(t){
block_read =fread(buff,1,size,stdin);
for(j=0;j<block_read;j++){
if(buff[j]=='\n'){
t--;
if(n%k==0){
divisible++;
}
n=0;
}
else{
n = n*10 + (buff[j] - '0');
}
}
}
printf("%d",divisible);
return 0;
How can this be done with python?
import sys
file = sys.stdin
size = 50000
t = 0
while(t != 0)
block_read = file.read(size)
...
...
Most probably this will not increase performance though – Python is interpreted language, so you basically want to spend as much time in native code (standard library input/parsing routines in this case) as possible.
TL;DR either use built-in routines to parse integers or get some sort of 3rd party library which is optimized for speed.
I tried solving this one in Python 3 and couldn't get it to work no matter how I tried reading the input. I then switched to running it under Python 2.5 so I could use
import psyco
psyco.full()
After making that change I was able to get it to work by simply reading input from sys.stdin one line at a time in a for loop. I read the first line using raw_input() and parsed the values of n and k, then used the following loop to read the remainder of the input.
for line in sys.stdin:
count += not int(line) % k

Categories