Read .mat file in Python. But the shape of the data changed - python

% save .mat file in the matlab
train_set_x=1:50*1*51*61*23;
train_set_x=reshape(train_set_x,[50,1,51,61,23]);
save(['pythonTest.mat'],'train_set_x','-v7.3');
The data obtained in the matlab is in the size of (50,1,51,61,23).
I load the .mat file in Python with the instruction of this link.
The code is as follows:
import numpy as np, h5py
f = h5py.File('pythonTest.mat', 'r')
train_set_x = f.get('train_set_x')
train_set_x = np.array(train_set_x)
The output of train_set_x.shape is (23L, 61L, 51L, 1L, 50L). It is expected to be (50L, 1L, 51L, 61L, 23L). So I changed the shape by
train_set_x=np.transpose(train_set_x, (4,3,2,1,0))
I am curious about the change in data shape between Python and matlab. Is there some errors in my code?

You do not have any errors in the code. There is a fundamental difference between Matlab and python in the way they treat multi-dimensional arrays.
Both Matalb and python store all the elements of the multi-dim array as a single contiguous block in memory. The difference is the order of the elements:
Matlab, (like fortran) stores the elements in a column-first fashion, that is storing the elements according to the dimensions of the array, for 2D:
[1 3;
2 4]
In contrast, Python, stores the elements in a row-first fashion, that is starting from the last dimension of the array:
[1 2;
3 4];
So a block in memory with size [m,n,k] in Matlab is seen by python as an array of shape [k,n,m].
For more information see this wiki page.
BTW, instead of transposing train_set_x, you might try setting its order to "Fortran" order (col-major as in Matlab):
train_set_x = np.array(train_set_x, order='F')

Related

Combining n dimensional arrays

I am in the process of converting some matlab code to python. I working with a 3d volume h x w x d represented as an numpy array, I am extracting smaller 3d patches from this volume using the function from SO here. So if I have 32x32x32 array and extract 16x16x16 patches I end up with a shape (2, 2, 2, 16, 16, 16) After processing each patch I would like to put it back into shape h x w x d basically reverse window_nd What would be the idiomatic numpy way without looping each dimension? Since I also need to work with 2d and 4d data I would like to avoid creating a function for each dimension.
Normally, writing back to as_strided views is not advised because it can cause race conditions, but since you only made blocks, this should work:
original_shaped_array = windowed_array.transpose(0,3,1,4,2,5).reshape(32,32,32)
Additionally, if you never copied the windowed array, and do calculations in-place, the data should be changed in the original array - a windowed view is simply a new view into the same data. Don't do this if there is any overlap

What is the fastest way to read in an image to an array of tuples?

I am trying to assign provinces to an area for use in a game mod. I have two separate maps for area and provinces.
provinces file,
area file.
Currently I am reading in an image in Python and storing it in an array using PIL like this:
import PIL
land_prov_pic = Image.open(INPUT_FILES_DIR + land_prov_str)
land_prov_array = np.array(land_prov_pic)
image_size = land_prov_pic.size
for x in range(image_size[0]):
if x % 100 == 0:
print(x)
for y in range(image_size[1]):
land_prov_array[x][y] = land_prov_pic.getpixel((x,y))
Where you end up with land_prov_array[x][y] = (R,G,B)
However, this get's really slow, especially for large images. I tried reading it in using opencv like this:
import opencv
land_prov_array = cv2.imread(INPUT_FILES_DIR + land_prov_str)
land_prov_array = cv2.cvtColor(land_prov_array, cv2.COLOR_BGR2RGB) #Convert from BGR to RGB
But now land_prov_array[x][y] = [R G B] which is an ndarray and can't be inserted into a set. But it's way faster than the previous for loop. How do I convert [R G B] to (R,G,B) for every element in the array without for loops or, better yet, read it in that way?
EDIT: Added pictures, more description, and code blocks for readability.
It is best to convert the [R,G,B] array to tuple when you need it to be a tuple, rather than converting the whole image to this form. An array of tuples takes up a lot more memory, and will be a lot slower to process, than a numeric array.
The answer by isCzech shows how to create a NumPy view over a 3D array that presents the data as if it were a 2D array of tuples. This might not require the additional memory of an actual array of tuples, but it is still a lot slower to process.
Most importantly, most NumPy functions (such as np.mean) and operators (such as +) cannot be applied to such an array. Thus, one is obliged to iterate over the array in Python code (or with a #np.vectorize function), which is a lot less efficient than using NumPy functions and operators that work on the array as a whole.
For transformation from a 3D array (data3D) to a 2D array (data2D), I've used this approach:
import numpy as np
dt = np.dtype([('x', 'u1'), ('y', 'u1'), ('z', 'u1')])
data2D = data3D.view(dtype=dt).squeeze()
The .view modifies the data type and returns still a 3D array with the last dimension of size 1 which can be then removed by .squeeze. Alternatively you can use .squeeze(axis=-1) to only squeeze the last dimension (in case some of your other dimensions are of size 1 too).
Please note I've used uint8 ('u1') - your type may be different.
Trying to do this using a loop is very slow, indeed (compared to this approach at least).
Similar question here: Show a 2d numpy array where contents are tuples as an image

Assigning values to list slices of large dense square matrices (Python)

I'm dealing with large dense square matrices of size NxN ~(100k x 100k) that are too large to fit into memory.
After doing some research, I've found that most people handle large matrices by either using numpy's memap or the pytables package. However, I've found that these packages seem to have major limitations. Neither of them seem to offer support ASSIGN values to list slices to the matrix on the disk along more than one dimension.
I would like to look for an efficient way to assign values to a large dense square matrix M with something like:
M[0, [1,2,3], [8,15,30]] = np.zeros((3, 3)) # or
M[0, [1,2,3,1,2,3,1,2,3], [8,8,8,15,15,15,30,30,30]] = 0 # for memmap
With memmap, the expression M[0, [1,2,3], [8,15,30]] would always copy the slice into RAM hence assignment doesn't seem to work.
With pytables, list slicing along more than 1 dimension is not supported. Currently I'm just slicing along 1 dimension following by the other dimension (i.e. M[0, [1,2,3]][:, [8,15,30]]). RAM usage of this solution would scale with N, which is better than dealing with the whole array (N^2) but is still not ideal.
In addition, it appears that pytables isn't the most efficient way of handling matrices with lots of rows. (or could there be a way of specifying the chunksize to get rid of this message?) I am getting the following warning message:
The Leaf ``/M`` is exceeding the maximum recommended rowsize (104857600 bytes);
be ready to see PyTables asking for *lots* of memory and possibly slow
I/O. You may want to reduce the rowsize by trimming the value of
dimensions that are orthogonal (and preferably close) to the *main*
dimension of this leave. Alternatively, in case you have specified a
very small/large chunksize, you may want to increase/decrease it.
I'm just wonder whether there are better solutions to assign values to arbitrary 2d slices of large matrices?
First of all, note that in numpy (not sure about pytables) M[0, [1,2,3], [8,15,30]] will return an array of shape (3,) corresponding to elements M[0,1,8], M[0,2,15] and M[0,3,30], so assigning np.zeros((3,3)) to that will raise an error.
Now, the following works fine with me:
np.save('M.npy', np.random.randn(5,5,5)) # create some dummy matrix
M = np.load('M.npy', mmap_mode='r+') # load such matrix as a memmap
M[[0,1,2],[1,2,3],[2,3,4]] = 0
M.flush() # make sure thing is updated on disk
del M
M = np.load('M.npy', mmap_mode='r+') # re-load matrix
print(M[[0,1,2],[1,2,3],[2,3,4]]) # should show array([0., 0., 0.])

scipy.io.loadmat reads MATLAB (R2016a) structs incorrectly

Instead of loading a MATLAB struct as a dict (as described in http://docs.scipy.org/doc/scipy/reference/tutorial/io.html and other related questions), scipy.io.loadmat is loading it as a strange ndarray, where the values are an array of arrays, and the field names are taken to be the dtype. Minimal example:
(MATLAB):
>> a = struct('b',0)
a =
b: 0
>> save('simple_struct.mat','a')
(Python):
In[1]:
import scipy.io as sio
matfile = sio.loadmat('simple_struct.mat')
a = matfile['a']
a
Out[1]:
array([[([[0]],)]],
dtype=[('b', 'O')])
This problem persists in Python 2 and 3.
This is expected behavior. Numpy is just showing you have MATLAB is storing your data under-the-hood.
MATLAB structs are 2+D cell arrays where one dimension is mapped to a sequence of strings. In Numpy, this same data structure is called a "record array", and the dtype is used to store the name. And since MATLAB matrices must be at least 2D, the 0 you stored in MATLAB is really a 2D matrix with dimensions (1, 1).
So what you are seeing in the scipy.io.loadmat is how MATLAB is storing your data (minus the dtype bit, MATLAB doesn't have such a thing). Specifically, it is a 2D [1, 1] array (that is what Numpy calls cell arrays), where one dimension is mapped to a string, containing a [1, 1] 2D array. MATLAB hides some of these details from you, but numpy doesn't.

Loading Analyze 7.5 format images in python

I'm doing some work whereby I have to load an manipulate CT images in a format called the Analyze 7.5 file format.
Part of this manipulation - which takes absolutely ages with large images - is loading the raw binary data to a numpy array and reshaping it to the correct dimensions. Here is an example:
headshape = (512,512,245) # The shape the image should be
headdata = np.fromfile("Analyze_CT_Head.img", dtype=np.int16) # loads the image as a flat array, 64225280 long. For testing, a large array of random numbers would do
head_shaped = np.zeros(shape=headshape) # Array to hold the reshaped data
# This set of loops is the problem
for ux in range(0, headshape[0]):
for uy in range(0, headshape[1]):
for uz in range(0, headshape[2]):
head_shaped[ux][uy][uz] = headdata[ux + headshape[0]*uy + (headshape[0]*headshape[1])*uz] # Note the weird indexing of the flat array - this is the pixel ordering I have to work with
I know numpy can do reshaping of arrays quickly, but I can't figure out the correct combination of transformations needed to replicate the effect of the nested loops.
Is there a way to replicate that strange indexing with some combination of numpy.reshape/numpy.ravel etc?
Take a look at the nibabel, a python library that implements readers/writers for the 'Analyze' format. It may have already solved this for you.
You could use reshape in combination with swapaxes
headshape = (2,3,4)
headdata = rand(2*3*4)
head_shaped_short = headdata.reshape(headshape[::-1]).swapaxes(0,2)
worked fine in my case.
numpy stores arrays flat in the memory. The strides attribute contains the necessary information how to map multidimensional indices to the flat indices in the memory.
Here is some further reading about numpy's memory layout.
This should work for you:
# get the number of bytes of the specified dtype
dtype = headdata.dtype
byte_count = dtype.itemsize
headdata = headdata.reshape(headshape)
x, y, z = headshape
headdata.strides = (byte_count, byte_count * x, byte_count * x * y)
# copy data to get back to standard memory layout
data = headdata.copy()
The code exploits setting the strides attribute to reflect your custom memory mapping and to create the (hopefully) correct multidimensional array. After that, it copies the whole array into data, in order to get back to a standard memory layout.

Categories