matlab data file to pandas DataFrame [duplicate] - python

This question already has answers here:
Read .mat files in Python
(15 answers)
Closed 6 years ago.
Is there a standard way to convert matlab .mat (matlab formated data) files to Panda DataFrame?
I am aware that a workaround is possible by using scipy.io but I am wondering whether there is a straightforward way to do it.

I found 2 way: scipy or mat4py.
mat4py
Load data from MAT-file
The function loadmat loads all variables stored in the MAT-file into a
simple Python data structure, using only Python’s dict and list
objects. Numeric and cell arrays are converted to row-ordered nested
lists. Arrays are squeezed to eliminate arrays with only one element.
The resulting data structure is composed of simple types that are
compatible with the JSON format.
Example: Load a MAT-file into a Python data structure:
data = loadmat('datafile.mat')
From:
https://pypi.python.org/pypi/mat4py/0.1.0
Scipy:
Example:
import numpy as np
from scipy.io import loadmat # this is the SciPy module that loads mat-files
import matplotlib.pyplot as plt
from datetime import datetime, date, time
import pandas as pd
mat = loadmat('measured_data.mat') # load mat-file
mdata = mat['measuredData'] # variable in mat file
mdtype = mdata.dtype # dtypes of structures are "unsized objects"
# * SciPy reads in structures as structured NumPy arrays of dtype object
# * The size of the array is the size of the structure array, not the number
# elements in any particular field. The shape defaults to 2-dimensional.
# * For convenience make a dictionary of the data using the names from dtypes
# * Since the structure has only one element, but is 2-D, index it at [0, 0]
ndata = {n: mdata[n][0, 0] for n in mdtype.names}
# Reconstruct the columns of the data table from just the time series
# Use the number of intervals to test if a field is a column or metadata
columns = [n for n, v in ndata.iteritems() if v.size == ndata['numIntervals']]
# now make a data frame, setting the time stamps as the index
df = pd.DataFrame(np.concatenate([ndata[c] for c in columns], axis=1),
index=[datetime(*ts) for ts in ndata['timestamps']],
columns=columns)
From:
http://poquitopicante.blogspot.fr/2014/05/loading-matlab-mat-file-into-pandas.html
Finally you can use PyHogs but still use scipy:
Reading complex .mat files.
This notebook shows an example of reading a Matlab .mat file,
converting the data into a usable dictionary with loops, a simple plot
of the data.
http://pyhogs.github.io/reading-mat-files.html

Ways to do this:
As you mentioned scipy
import scipy.io as sio
test = sio.loadmat('test.mat')
Using the matlab engine:
import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat",nargout=1)

Related

Read a HDF data to a 3d array and save as a dataframe in python

I am currently working on the NASA aerosol optical depth data (MCD19A2), which is a NASA satellite level three product. I have uploaded the data. I want to save the data as a dataframe including all the information of longitude and latitude, and values. I have successfully converted the 0.47um band file into a three-dimensional array. I want to ask how to convert this array into a correct dataframe includes X, Y and the value.
Below are the codes I have tried:
from osgeo import gdal
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
rds = gdal.Open("MCD19A2.A2006001.h26v04.006.2018036214627.hdf")
names=rds.GetSubDatasets()
names[0][0]
*'HDF4_EOS:EOS_GRID:"MCD19A2.A2006001.h26v04.006.2018036214627.hdf":grid1km:Optical_Depth_047'*
aod_047 = gdal.Open(names[0][0])
a47=aod_047.ReadAsArray()
a47[1].shape
(1200,1200)
I would like the result to be like
X (n=1200)
Y (n=1200)
AOD_047
8896067
5559289
0.0123
I know that in R this can be done by
require('gdalUtils')
require('raster')
require('rgdal')
file.name<-"MCD19A2.A2006001.h26v04.006.2018036214627.hdf"
sds <- get_subdatasets(file.name)
gdal_translate(sds[1], dst_dataset = paste0('tmp047', basename(file.name), '.tiff'), b = nband)
r.047 <- raster(paste0('tmp047', basename(file.name), '.tiff'))
df.047 <- raster::as.data.frame(r.047, xy = T)
names(df.047)[3] <- 'AOD_047'
But, R really relies on memory and saving to 'tif' and reading 'tif' is using a lot of memory. So I want to do this task in python. Thanks a lot for your help.
You can use pandas:
import pandas as pd
df=pd.read_hdf('filename.hdf')

How to Load .mat Folder and files

I am trying to load a .mat dataset into my dataframe. So, I am only able to load a single file at a time from the Folder TrainingSet1 with
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
data = loadmat('A2001.mat')
And i am able to see the data in it, but how am i supposed to load the whole TrainingSet1 Folder, so that i can view the whole thing.
Also, how could I view the .mat files as images?
Heres my code,
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.vision import *
from fastai.metrics import error_rate
from mat4py import loadmat
from pylab import*
import matplotlib
import os
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
data = loadmat('A2001.mat')
data
{'ECG': {'sex': 'Male', 'age': 68,
'data': [[0.009784321006571624,
0.006006033870606647,
...This is roughly how the data looks like
imshow('A2001.mat',[])
---------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-23bbdf3a7668> in <module>
----> 1 imshow('A2001.mat',[])...A long error is displayed
TypeError: unhashable type: 'list'
Thanks for any help
It hard to tell from your post what is the input format, and what is your desired output format.
I am giving you an example of reading all the .mat files in the folder, and an example of how to show data['data'] as image.
I hope the example is enough for you to keep advancing by your own.
I created a sample data set 'A2001.mat', 'A2002.mat', 'A2003.mat' using MATLAB.
In case you have MATLAB installation, I recommend you to execute the following code for creating a sample input (in order for the Python sample to be reproducible):
ECG.sex = 'Male';
ECG.age = 68;
data = im2double(imread('cameraman.tif')) / 10; % Divide by 10 for simulating range [0, 0.1] instead of [0, 1]
save('A2001.mat', 'ECG', 'data');
ECG.sex = 'Male';
ECG.age = 46;
data = im2double(imread('cell.tif'));
save('A2002.mat', 'ECG', 'data');
ECG.sex = 'Female';
ECG.age = 54;
data = im2double(imread('tire.tif'));
save('A2003.mat', 'ECG', 'data');
The Python code sample does the following:
Get a list of all mat files in the folder using glob.glob('*.mat').
Iterate mat files, load data from the files, and append the data to a list.
The result of the loop is a list named alldata, containing data from all mat files.
Iterate alldata and showing data['data'] as an image.
(Assuming data['data'] is the matrix you want to show as an image).
Here is the code:
from matplotlib import pyplot as plt
from mat4py import loadmat
import glob
import os
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
# Get a list for .mat files in current folder
mat_files = glob.glob('*.mat')
# List for stroring all the data
alldata = []
# Iterate mat files
for fname in mat_files:
# Load mat file data into data.
data = loadmat(fname)
# Append data to the list
alldata.append(data)
# Iterate alldata elelemts, and show images
for data in alldata:
# Assume image is stored in matrix named data in MATLAB.
# data['data'], access data with string 'data', becuase data is a dictionary
img = data['data']
# Show data as image using matplotlib
plt.imshow(img, cmap='gray')
plt.show(block=True) # Show image with "blocking"
Update:
The ECG data is not an image but a list of 12 data samples.
The internal structure of the data (after data = loadmat(fname)) is:
Parent dictionary named data.
data contains a dictionary in data['ECG'].
data['ECG']['data'] is a list of 12 lists.
The following code iterates the mat files and displays the ECG data as a graph:
from matplotlib import pyplot as plt
from mat4py import loadmat
import glob
import os
import numpy as np
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
# Get a list for .mat files in current folder
mat_files = glob.glob('*.mat')
# List for stroring all the data
alldata = []
# Iterate mat files
for fname in mat_files:
# Load mat file data into data.
data = loadmat(fname)
# Append data to the list
alldata.append(data)
# Iterate alldata elelemts, and show images
for data in alldata:
# The internal structure of the data is a dictionary with a dictionary.
ecg = data['ECG']
data = ecg['data'] # Data is a list of lists
# Convert data to NumPy array
ecg_data = np.array(data)
# Show data as image using matplotlib
#plt.imshow(img, cmap='gray')
plt.plot(ecg_data.T) # Plot the data as graph.
plt.show(block=True) # Show image with "blocking"
Result:
A0001.mat:
A0002.mat:
Graph with labels:
# Iterate alldata elements, and show images
for data in alldata:
# The internal structure of the data is a dictionary with a dictionary.
ecg = data['ECG']
data = ecg['data'] # Data is a list of lists
# Convert data to NumPy array
#ecg_data = np.array(data)
# Show data as graph using matplotlib
# Iterate data list:
for i in range(len(data)):
# Plot the data as graph.
# Set labels d0, d1, d2...
plt.plot(data[i], label='d'+str(i))
plt.legend() # Add legend
plt.show(block=True) # Show image with "blocking"
Result:

Can't export a NumPy array to R dataset

I'm having some difficulties trying to convert a NumPy array to R data set. I have a 2D image of 2362 x 2163 on gray scale. I have to import the gray scale value of each pixel to do some statistical analysis. First, this is what I imported for the process:
import cv2
import numpy as np
from skimage import img_as_ubyte
from skimage import data
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
from pandas import DataFrame
pandas2ri.activate()
First I import the image using cv2 as a NumPy array and just in case I converted the array to values between 0 and 255 (256 gray levels).
xchest = cv2.imread("/home/user/xchest.tif", cv2.IMREAD_GRAYSCALE)
xchestsk = img_as_ubyte(xchest)
The output of:
type(xchestsk)
is:
<class 'numpy.ndarray'>
I visually checked the array an as expected is something like:
[[8, 8, 9, ... 200, 234, 245]...[250, 234, 134, ... 67, 8, 8]]
I need all that pixel information on a simple data set that I can use and analyze on RStudio. I tried with:
xchest_R = DataFrame(xchestsk)
xchest_R = r.data('xchestsk')
r.assign("test", xchest_R)
r("save(test, file='/home/user/xchest.gzip', compress=TRUE)")
But when I load it on R:
> load("/home/user/xchest.gzip")
I just get a value: test > "xchestsk"
Like if I just imported a string.
I tried with:
np.save("/home/user/xchest.npy", xchestsk)
But when I try to import it on R with:
> library(RcppCNPy)
> xchest_R <- npyLoad("/home/user/xchest.npy", "integer")
RStudio crashes and I have to restart the session.
Finally, I tried converting the NumPy array to a CSV file:
np.savetxt("/home/eera5607/xchest.csv", xchestsk, delimiter=",")
But when I import it to R:
> xchest_data = read.csv(file="/home/eera5607/xchest.csv", header=FALSE, sep=",")
I can't do simple statistical analysis like:
> mean(xchest_data)
Because I get this warning:
> argument is not numeric or logical: returning NA
I tried converting the data to one variable and 5000000+ points with:
xchest_list = xchestsk.tolist()
xchest_ov = []
for list in xchest_list:
xchest_ov += list
Then I converted the xchest_ov list to CSV but I get the same warning in RStudio.
What I need is to import all those values, if possible, keeping the matrix structure (but it is not necessary, at least import the pixel values as a regular R data set) to which I can apply some statistical analysis. I know I can do some analysis directly on Python but I would like this data on RStudio. I have very little knowledge in this topics. I'm a radiologist and this is the first time I'm working with R.

Export synthetic dataset created with sklearn to csv

I need to create a synthetic dataset, cause i have to fix a clustering algorithm for my university thesis, so i need it to test the algorithm with a little dataset.
I managed to create it with sklearn make_classification, but the program takes in input a csv file that contains the features of the dataset.
Does anyone know how can i manage to create a synthetic dataset directly in csv, or export the one created with sklearn into a csv file?
You can export a numpy array to a csv file using numpy.savetxt.
This example uses a BytesIO instance as output, you would use a file name instead.
In [1]: import io
In [2]: import numpy as np
In [3]: x = np.random.randn(5, 2)
In [4]: x
Out[4]:
array([[-0.13114465, -0.72491874],
[-0.08375738, -1.23769691],
[-0.5583027 , -0.24086865],
[ 0.04590227, -0.6582806 ],
[-0.21433652, -0.78924272]])
In [5]: buf = io.BytesIO()
In [6]: np.savetxt(buf, x, delimiter=',')
In [7]: print(buf.getvalue().decode())
-1.311446488105691699e-01,-7.249187409818331762e-01
-8.375738326459475358e-02,-1.237696910731503452e+00
-5.583026953882282983e-01,-2.408686450946319058e-01
4.590226685041418758e-02,-6.582805971999975414e-01
-2.143365241670896482e-01,-7.892427231682124233e-01

how to combine numpy ndarray?

I have MODIS atmospheric product. I used the code below to read the data.
%matplotlib inline
import numpy as np
from pyhdf import SD
import matplotlib.pyplot as plt
files = ['file1.hdf','file2.hdf','file3.hdf']
for n in files:
hdf=SD.SD(n)
lat = (hdf.select('Latitude'))[:]
lon = (hdf.select('Longitude'))[:]
sds=hdf.select('Deep_Blue_Aerosol_Optical_Depth_550_Land')
data=sds.get()
attributes = sds.attributes()
scale_factor = attributes['scale_factor']
data= data*scale_factor
plt.contourf(lon,lat,data)
The problem is, in some days, there are 3 data sets (as in this case, some days have four datasets) so I can not use hstack or vstack to merge these datasets.
My intention is to get the single array from three different data arrays.
I have also attached datafiles along with this link:https://drive.google.com/open?id=0B2rkXkOkG7ExYW9RNERaZU5lam8
your help will be highly appreciated.

Categories