I have read a Matlab file containing a large amount of arrays as a dataset into Python storing the Matlab Dictionary under the variable name mat using the command:
mat = loadmat('Sample Matlab Extract.mat')
Is there a way I can then use Python's write to csv functionality to save this Matlab dictionary variable I read into Python as a comma separated file?
with open('mycsvfile.csv','wb') as f:
w = csv.writer(f)
w.writerows(mat.items())
f.close()
creates a CSV file with one column containing array names within the dictionary and then another column containing the first element of each corresponding array. Is there a way to utilize a command similar to this to obtain all corresponding elements within the arrays inside of the 'mat' dictionary variable?
The function scipy.io.loadmat generates a dictionary looking something like this:
{'__globals__': [],
'__header__': 'MATLAB 5.0 MAT-file, Platform: MACI, Created on: Wed Sep 24 16:11:51 2014',
'__version__': '1.0',
'a': array([[1, 2, 3]], dtype=uint8),
'b': array([[4, 5, 6]], dtype=uint8)}
It sounds like what you want to do is make a .csv file with the keys "a", "b", etc. as the column names and their corresponding arrays as data associated with each column. If so, I would recommend using pandas to make a nicely formatted dataset that can be exported to a .csv file. First, you need to clean out the commentary members of your dictionary (all the keys beginning with "__"). Then, you want to turn each item value in your dictionary into a pandas.Series object. The dictionary can then be turned into a pandas.DataFrame object, which can also be saved as a .csv file. Your code would look like this:
import scipy.io
import pandas as pd
mat = scipy.io.loadmat('matex.mat')
mat = {k:v for k, v in mat.items() if k[0] != '_'}
data = pd.DataFrame({k: pd.Series(v[0]) for k, v in mat.items()}) # compatible for both python 2.x and python 3.x
data.to_csv("example.csv")
This is correct solution for converting any .mat file into .csv file. Try it
import scipy.io
import numpy as np
data = scipy.io.loadmat("file.mat")
for i in data:
if '__' not in i and 'readme' not in i:
np.savetxt(("file.csv"),data[i],delimiter=',')
import scipy.io
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
class MatDataToCSV():
def init(self):
pass
def convert_mat_tocsv(self):
mat = scipy.io.loadmat('wiki.mat')
instances = mat['wiki'][0][0][0].shape[1]
columns = ["dob", "photo_taken", "full_path", "gender",\
"name", "face_location", "face_score", "second_face_score"]
df = pd.DataFrame(index = range(0,instances), columns = columns)
for i in mat:
if i == "wiki":
current_array = mat[i][0][0]
for j in range(len(current_array)):
df[columns[j]] = pd.DataFrame(current_array[j][0])
return df
reading a matfile (.MAT) with the below code
data = scipy.io.loadmat(files[0])
gives a dictionary of values and keys
and " 'header', 'version', 'globals'" these are some of the default values which we need to remove
cols=[]
for i in data:
if '__' not in i :
cols.append(i)
temp_df=pd.DataFrame(columns=cols)
for i in data:
if '__' not in i :
temp_df[i]=(data[i]).ravel()
we remove the unwanted header values using "if '__' not in i:" and then make a dataframe using the rest of the headers and finally assign the column values to respective column headers
Related
Python 3.9.5/Pandas 1.1.3
I use the following code to create a nested dictionary object from a csv file with headers:
import pandas as pd
import json
import os
csv = "/Users/me/file.csv"
csv_file = pd.read_csv(csv, sep=",", header=0, index_col=False)
csv_file['org'] = csv_file[['location', 'type']].apply(lambda s: s.to_dict(), axis=1)
This creates a nested object called org from the data in the columns called location and type.
Now let's say the type column doesn't even exist in the csv file, and I want to pass a literal string as a type value instead of the values from a column in the csv file. So for example, I want to create a nested object called org using the values from the data column as before, but I want to just use the string foo for all values of a key called type. How to accomplish this?
You could just build it by hand:
csv_file['org'] = csv_file['location'].apply(lambda x: {'location': x,
'type': 'foo'})
use Chainmap. This will allow to use multiple columns (columns_to_use), and even override existing ones (if type is in these columns, it will be overridden):
from collections import ChainMap
# .. some code
csv_file['org'] = csv_file[columns_to_use].apply(
lambda s: ChainMap({'type': 'foo'}, s.to_dict()), axis=1)
BTW, without adding constant values it could be done by df.to_dict():
csv_file['org'] = csv_file[['location', 'type']].to_dict('records')
I'm writing a very small Pandas dataframe to a JSON file. In fact, the Dataframe has only one row with two columns.
To build the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict(dict({'date': '2020-10-05', 'ppm': 411.1}), orient='index').T
print(df)
prints
date ppm
0 2020-10-05 411.1
The desired json output is as follows:
{
"date": "2020-10-05",
"ppm": 411.1
}
but when writing the json with pandas, I can only print it as an array with one element, like so:
[
{
"date":"2020-10-05",
"ppm":411.1
}
]
I've currently hacked my code to convert the Dataframe to a dict, and then use the json module to write the file.
import json
data = df.to_dict(orient='records')
data = data[0] # keep the only element
with open('data.json', 'w') as fp:
json.dump(data, fp, indent=2)
Is there a native way with pandas' .to_json() to keep the only dictionary item if there is only one?
I am currently using .to_json() like this, which incorrectly prints the array with one dictionary item.
df.to_json('data.json', orient='index', indent = 2)
Python 3.8.6
Pandas 1.1.3
If you want to export only one row, use iloc:
print (df.iloc[0].to_dict())
#{'date': '2020-10-05', 'ppm': 411.1}
The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv
IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).
create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)
I have a csv file which contains four columns and many rows, each representing different data, e.g.
OID DID HODIS BEAR
1 34 67 98
I have already opened and read the csv file, however I am unsure how I can make each column into a key. I believe the following format I have used in the code is best for the task I am creating.
Please see my code bellow, sorry if the explanation is a bit confusing.
Note that the #Values in column 1 is what I am stuck on, I am unsure how I can define each column.
for line in file_2:
the_dict = {}
OID = line.strip().split(',')
DID = line.strip().split(',')
HODIS = line.strip().split(',')
BEAR = line.strip().split(',')
the_dict['KeyOID'] = OID
the_dict['KeyDID'] = DID
the_dict['KeyHODIS'] = HODIS
the_dict['KeyBEAR'] = BEAR
dictionary_list.append(the_dict)
print(dictionary_list)
image
There is a great Python function for strings that will split strings based on a delimiter, .split(delim) where delim is the delimiter, and returns them as a list.
From the code that you have in your screenshot, you can use the following code to split on a ,, which I assume is your delimiter because you said that your file is a CSV.
...
for line in file_contents_2:
the_dict = {}
values = line.strip().split(',')
OID = values[0]
DID = values[1]
HODIS = values[2]
BEAR = values[3]
...
Also, in case you ever need to split a string based on whitespace, that is the default argument for .split() (the default argument is used when no argument is provided).
I would say this as whole code:
lod = []
with open(file,'r') as f:
l=f.readlines()
for i in l[1:]:
lod.append(dict(zip(l[0].rstrip().split(),i.split())))
split doesn't need a parameter, just use simple for loop in with open, no need for knowing keys
And if care about empty dictionaries do:
lod=list(filter(None,lod))
print(lod)
Output:
[{'OID': '1', 'DID': '34', 'HODIS': '67', 'BEAR': '98'}]
If want integers:
lod=[{k:int(v) for k,v in i.items()} for i in lod]
print(lod)
Output:
[{'OID': 1, 'DID': 34, 'HODIS': 67, 'BEAR': 98}]
Another way to do it is using libraries like Pandas, that is powerful in working with tabular data. It is fast as we avoid loops. In the example below you only need Pandas and the name of the CSV file. I used io just to transform string data to mimic csv.
import pandas as pd
from io import StringIO
data=StringIO('''
OID,DID,HODIS,BEAR\n
1,34,67,98''') #mimic csv file
df = pd.read_csv(data,sep=',')
print(df.T.to_dict()[0])
At the bottom you need only one-liner that chains commands. Read csv, transpose and tranform to dictionary:
import pandas as pd
csv_dict = pd.read_csv('mycsv.csv',sep=',').T.to_dict()[0]
Is it possible to load matlab tables in python using scipy.io.loadmat?
What I'm doing:
In Matlab:
tab = table((1:500)')
save('tab.mat', 'tab')
In Python:
import scipy.io
mat = scipy.io.loadmat('m:/tab.mat')
But I cannot access the table tab in Python using mat['tab']
The answer to your question is no. Many matlab objects can be loaded in python. Tables, among others, can not be loaded. See Handle Data Returned from MATLAB to Python
The loadmat function doesn't load MATLAB tables. Instead a small workaround can be done. The tables can be saves as .csv files which can then be read using pandas.
In MATLAB
writetable(table_name, file_name)
In Python
df = pd.read_csv(file_name)
At the end, the DataFrame df will have the contents of table_name
I've looked into this for a project I'm working on, and as a workaround, you could try the following.
In MATLAB, first convert the #table object into a struct, and retrieve the column names using:
table_struct = struct(table_object);
table_columns = table_struct.varDim.labels;
save table_as_struct table_struct table_columns;
And then you can try the following code in python:
import numpy
import pandas as pd
import scipy.io
# function to load table variable from MAT-file
def loadtablefrommat(matfilename, tablevarname, columnnamesvarname):
"""
read a struct-ified table variable (and column names) from a MAT-file
and return pandas.DataFrame object.
"""
# load file
mat = scipy.io.loadmat(matfilename)
# get table (struct) variable
tvar = mat.get(tablevarname)
data_desc = mat.get(columnnamesvarname)
types = tvar.dtype
fieldnames = types.names
# extract data (from table struct)
data = None
for idx in range(len(fieldnames)):
if fieldnames[idx] == 'data':
data = tvar[0][0][idx]
break;
# get number of columns and rows
numcols = data.shape[1]
numrows = data[0, 0].shape[0]
# and get column headers as a list (array)
data_cols = []
for idx in range(numcols):
data_cols.append(data_desc[0, idx][0])
# create dict out of original table
table_dict = {}
for colidx in range(numcols):
rowvals = []
for rowidx in range(numrows):
rowval = data[0,colidx][rowidx][0]
if type(rowval) == numpy.ndarray and rowval.size > 0:
rowvals.append(rowval[0])
else:
rowvals.append(rowval)
table_dict[data_cols[colidx]] = rowvals
return pd.DataFrame(table_dict)
Based on Jochens answer i propose a different variant that does a good job for me.
I wrote a Matlab Script to prepare the m-file automatically (see my GitLab Repositroy with examples).
It does the following:
In Matlab for class table:
Does the same like Jochens example, but binds the data together. So it is easier to load multiple variables. The names "table" and "columns" are mandatory for the next part.
YourVariableName = struct('table', struct(TableYouWantToLoad), 'columns', {struct(TableYouWantToLoad).varDim.labels})
save('YourFileName', 'YourVariableName')
In Matlab for class dataset:
Alternative, if you have to handle the old dataset type.
YourVariableName = struct('table', struct(DatasetYouWantToLoad), 'columns', {get(DatasetYouWantToLoad,'VarNames')})
save('YourFileName', 'YourVariableName')
In Python:
import scipy.io as sio
mdata = sio.loadmat('YourFileName')
mtable = load_table_from_struct(mdata['YourVariableName'])
with
import pandas as pd
def load_table_from_struct(table_structure) -> pd.DataFrame():
# get prepared data structure
data = table_structure[0, 0]['table']['data']
# get prepared column names
data_cols = [name[0] for name in table_structure[0, 0]['columns'][0]]
# create dict out of original table
table_dict = {}
for colidx in range(len(data_cols)):
table_dict[data_cols[colidx]] = [val[0] for val in data[0, 0][0, colidx]]
return pd.DataFrame(table_dict)
It is independent from loading the file, but basically a minimized versions of Jochens Code. So please give him kudos for his post.
As others have mentioned, this is currently not possible, because Matlab has not documented this file format. People are trying to reverse engineer the file format but this is a work in progress.
A workaround is to write the table to CSV format and to load that using Python. The entries in the table can be variable length arrays and these will be split across numbered columns. I have written a short function to load both scalars and arrays from this CSV file.
To write the table to CSV in matlab:
writetable(table_name, filename)
To read the CSV file in Python:
def load_matlab_csv(filename):
"""Read CSV written by matlab tablewrite into DataFrames
Each entry in the table can be a scalar or a variable length array.
If it is a variable length array, then Matlab generates a set of
columns, long enough to hold the longest array. These columns have
the variable name with an index appended.
This function infers which entries are scalars and which are arrays.
Arrays are grouped together and sorted by their index.
Returns: scalar_df, array_df
scalar_df : DataFrame of scalar values from the table
array_df : DataFrame with MultiIndex on columns
The first level is the array name
The second level is the index within that array
"""
# Read the CSV file
tdf = pandas.read_table(filename, sep=',')
cols = list(tdf.columns)
# Figure out which columns correspond to scalars and which to arrays
scalar_cols = [] # scalar column names
arr_cols = [] # array column names, without index
arrname2idxs = {} # dict of array column name to list of integer indices
arrname2colnames = {} # dict of array column name to list of full names
# Iterate over columns
for col in cols:
# If the name ends in "_" plus space plus digits, it's probably
# from an array
if col[-1] in '0123456789' and '_' in col:
# Array col
# Infer the array name and index
colsplit = col.split('_')
arr_idx = int(colsplit[-1])
arr_name = '_'.join(colsplit[:-1])
# Store
if arr_name in arrname2idxs:
arrname2idxs[arr_name].append(arr_idx)
arrname2colnames[arr_name].append(col)
else:
arrname2idxs[arr_name] = [arr_idx]
arrname2colnames[arr_name] = [col]
arr_cols.append(arr_name)
else:
# Scalar col
scalar_cols.append(col)
# Extract all scalar columns
scalar_df = tdf[scalar_cols]
# Extract each set of array columns into its own dataframe
array_df_d = {}
for arrname in arr_cols:
adf = tdf[arrname2colnames[arrname]].copy()
adf.columns = arrname2idxs[arrname]
array_df_d[arrname] = adf
# Concatenate array dataframes
array_df = pandas.concat(array_df_d, axis=1)
return scalar_df, array_df
scalar_df, array_df = load_matlab_csv(filename)