scipy.io.loadmat reads MATLAB (R2016a) structs incorrectly

scipy.io.loadmat reads MATLAB (R2016a) structs incorrectly - python

Instead of loading a MATLAB struct as a dict (as described in http://docs.scipy.org/doc/scipy/reference/tutorial/io.html and other related questions), scipy.io.loadmat is loading it as a strange ndarray, where the values are an array of arrays, and the field names are taken to be the dtype. Minimal example:
(MATLAB):
>> a = struct('b',0)
a =
b: 0
>> save('simple_struct.mat','a')
(Python):
In[1]:
import scipy.io as sio
matfile = sio.loadmat('simple_struct.mat')
a = matfile['a']
a
Out[1]:
array([[([[0]],)]],
dtype=[('b', 'O')])
This problem persists in Python 2 and 3.

This is expected behavior. Numpy is just showing you have MATLAB is storing your data under-the-hood.
MATLAB structs are 2+D cell arrays where one dimension is mapped to a sequence of strings. In Numpy, this same data structure is called a "record array", and the dtype is used to store the name. And since MATLAB matrices must be at least 2D, the 0 you stored in MATLAB is really a 2D matrix with dimensions (1, 1).
So what you are seeing in the scipy.io.loadmat is how MATLAB is storing your data (minus the dtype bit, MATLAB doesn't have such a thing). Specifically, it is a 2D [1, 1] array (that is what Numpy calls cell arrays), where one dimension is mapped to a string, containing a [1, 1] 2D array. MATLAB hides some of these details from you, but numpy doesn't.

Related

Different decimal formats within same numpy array

I reshaped a 3D NumPy array to 2D using the reshape method by X1 = np.reshape(input,(500, 3*40)). Now the new 2D array has different formats such as,
few rows have the following format -
X1[8,:] has -
array([ 5557., 2001., 1434., 1348., 991., 1240., 1668., 1093.,
1680., 1476., 2521., 1841., 2443., 2295., 1911., 2491., and so on .... ])
whereas few other rows have the following format -
X1[9,:] has -
array([3.69900e+04, 1.19090e+04, 1.12300e+04, 1.25170e+04, 6.91000e+03,
7.24700e+03, 8.31800e+03, 6.31000e+03, 8.96700e+03, 7.18100e+03,
1.03010e+04, 9.69800e+03, 1.29270e+04, 1.33140e+04, 1.00420e+04, and so on ... ])
Since they don't have the same format throughout, I am not sure if it will cause a problem during neural network model training. I am not sure how to maintain the same decimal format throughout the same NumPy array.

That isn't problem for You, because 5557. and 1.03010e+04 are float both. The second number format ( scientific notation is only for show (print) the numbers ).
Remeber that numpy array has just one data tipe for all items in an array, you could get it with array.dtype attribute

What is the fastest way to read in an image to an array of tuples?

I am trying to assign provinces to an area for use in a game mod. I have two separate maps for area and provinces.
provinces file,
area file.
Currently I am reading in an image in Python and storing it in an array using PIL like this:
import PIL
land_prov_pic = Image.open(INPUT_FILES_DIR + land_prov_str)
land_prov_array = np.array(land_prov_pic)
image_size = land_prov_pic.size
for x in range(image_size[0]):
if x % 100 == 0:
print(x)
for y in range(image_size[1]):
land_prov_array[x][y] = land_prov_pic.getpixel((x,y))
Where you end up with land_prov_array[x][y] = (R,G,B)
However, this get's really slow, especially for large images. I tried reading it in using opencv like this:
import opencv
land_prov_array = cv2.imread(INPUT_FILES_DIR + land_prov_str)
land_prov_array = cv2.cvtColor(land_prov_array, cv2.COLOR_BGR2RGB) #Convert from BGR to RGB
But now land_prov_array[x][y] = [R G B] which is an ndarray and can't be inserted into a set. But it's way faster than the previous for loop. How do I convert [R G B] to (R,G,B) for every element in the array without for loops or, better yet, read it in that way?
EDIT: Added pictures, more description, and code blocks for readability.

It is best to convert the [R,G,B] array to tuple when you need it to be a tuple, rather than converting the whole image to this form. An array of tuples takes up a lot more memory, and will be a lot slower to process, than a numeric array.
The answer by isCzech shows how to create a NumPy view over a 3D array that presents the data as if it were a 2D array of tuples. This might not require the additional memory of an actual array of tuples, but it is still a lot slower to process.
Most importantly, most NumPy functions (such as np.mean) and operators (such as +) cannot be applied to such an array. Thus, one is obliged to iterate over the array in Python code (or with a #np.vectorize function), which is a lot less efficient than using NumPy functions and operators that work on the array as a whole.

For transformation from a 3D array (data3D) to a 2D array (data2D), I've used this approach:
import numpy as np
dt = np.dtype([('x', 'u1'), ('y', 'u1'), ('z', 'u1')])
data2D = data3D.view(dtype=dt).squeeze()
The .view modifies the data type and returns still a 3D array with the last dimension of size 1 which can be then removed by .squeeze. Alternatively you can use .squeeze(axis=-1) to only squeeze the last dimension (in case some of your other dimensions are of size 1 too).
Please note I've used uint8 ('u1') - your type may be different.
Trying to do this using a loop is very slow, indeed (compared to this approach at least).
Similar question here: Show a 2d numpy array where contents are tuples as an image

Matlab to Python numpy indexing and multiplication issue

I have the following line of code in MATLAB which I am trying to convert to Python numpy:
pred = traindata(:,2:257)*beta;
In Python, I have:
pred = traindata[ : , 1:257]*beta
beta is a 256 x 1 array.
In MATLAB,
size(pred) = 1389 x 1
But in Python,
pred.shape = (1389L, 256L)
So, I found out that multiplying by the beta array is producing the difference between the two arrays.
How do I write the original Python line, so that the size of pred is 1389 x 1, like it is in MATLAB when I multiply by my beta array?

I suspect that beta is in fact a 1D numpy array. In numpy, 1D arrays are not row or column vectors where MATLAB clearly makes this distinction. These are simply 1D arrays agnostic of any shape. If you must, you need to manually introduce a new singleton dimension to the beta vector to facilitate the multiplication. On top of this, the * operator actually performs element-wise multiplication. To perform matrix-vector or matrix-matrix multiplication, you must use numpy's dot function to do so.
Therefore, you must do something like this:
import numpy as np # Just in case
pred = np.dot(traindata[:, 1:257], beta[:,None])
beta[:,None] will create a 2D numpy array where the elements from the 1D array are populated along the rows, effectively making a column vector (i.e. 256 x 1). However, if you have already done this on beta, then you don't need to introduce the new singleton dimension. Just use dot normally:
pred = np.dot(traindata[:, 1:257], beta)

Read .mat file in Python. But the shape of the data changed

% save .mat file in the matlab
train_set_x=1:50*1*51*61*23;
train_set_x=reshape(train_set_x,[50,1,51,61,23]);
save(['pythonTest.mat'],'train_set_x','-v7.3');
The data obtained in the matlab is in the size of (50,1,51,61,23).
I load the .mat file in Python with the instruction of this link.
The code is as follows:
import numpy as np, h5py
f = h5py.File('pythonTest.mat', 'r')
train_set_x = f.get('train_set_x')
train_set_x = np.array(train_set_x)
The output of train_set_x.shape is (23L, 61L, 51L, 1L, 50L). It is expected to be (50L, 1L, 51L, 61L, 23L). So I changed the shape by
train_set_x=np.transpose(train_set_x, (4,3,2,1,0))
I am curious about the change in data shape between Python and matlab. Is there some errors in my code?

You do not have any errors in the code. There is a fundamental difference between Matlab and python in the way they treat multi-dimensional arrays.
Both Matalb and python store all the elements of the multi-dim array as a single contiguous block in memory. The difference is the order of the elements:
Matlab, (like fortran) stores the elements in a column-first fashion, that is storing the elements according to the dimensions of the array, for 2D:
[1 3;
2 4]
In contrast, Python, stores the elements in a row-first fashion, that is starting from the last dimension of the array:
[1 2;
3 4];
So a block in memory with size [m,n,k] in Matlab is seen by python as an array of shape [k,n,m].
For more information see this wiki page.
BTW, instead of transposing train_set_x, you might try setting its order to "Fortran" order (col-major as in Matlab):
train_set_x = np.array(train_set_x, order='F')

Matlab-Python translation error

Matlab Code:
AP(queryIdx) = diff([0;recall]')*prec
My python code:
AP[queryIdx] = np.dot(np.diff(np.concatenate(([[0]], recall), axis=0).transpose()),prec)
Variables:(Checked and am quite sure they are equivalent in python and in Matlab)
Recall: 1000x1 np array*
prec: 1000x1 np array
* prints out as [[.],.....,[.]]
Results:
Matlab: .1011
Python: 0.05263158
Only cause I can think of outside of the code is that python uses more
precision, but I doubt that would make such a large difference)
*Edit There was a problem with my prec variable. The above code worked

That code looks a bit messy. Try replacing it with this:
AP[queryIdx] = np.dot(np.diff(np.hstack([0, recall.ravel()])), prec.ravel())
In your post, you mentioned that you have a 1000 x 1 array for both recall and prec. This to me is interpreted as a 2D array with a singleton dimension: the second dimension. As such, you'd need to convert this back to a 1D array using ravel.
Now, np.hstack horizontally stacks 1D arrays together and so this will append a 0 at the front, then apply the diff operator, and the perform the dot product with prec.
One common gotcha that MATLAB coders have with numpy is the representation of 1D arrays in numpy. There is no such thing as the transpose of a 1D array. All numpy 1D arrays are row vectors. If you explicitly want to make the 1D array a column vector, you need to include an additional dimension and make the second dimension 1, then transpose it. Something like this:
r = v[:][None].T
In any case, let's verify the results:
MATLAB
>> recall = (1:1000).';
>> prec = (1000:-1:1).';
>> diff([0; recall].')*prec
ans =
500500
Python (IPython)
In [1]: import numpy as np
In [2]: recall = np.arange(1,1001)
In [3]: prec = np.arange(1000,0,-1)
In [4]: np.dot(np.diff(np.hstack([0, recall.ravel()])), prec.ravel())
Out[4]: 500500

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scipy.io.loadmat reads MATLAB (R2016a) structs incorrectly - python

Related

Different decimal formats within same numpy array

What is the fastest way to read in an image to an array of tuples?

Matlab to Python numpy indexing and multiplication issue

Read .mat file in Python. But the shape of the data changed

Matlab-Python translation error

Categories

Resources