I am trying to write a following matlab code in python:
function[x,y,z] = Testfunc(filename, newdata, a, b)
sheetname = 'Test1';
data = xlsread(filename, sheetname);
if data(1) == 1
newdata(1,3) = data(2);
newdata(1,4) = data(3);
newdata(1,5) = data(4);
newdata(1,6) = data(5)
else
....
....
....
It is very long function but this is the part where I am stuck and have no clue at all.
This is what I have written so far in python:
import pandas as pd
def test_func(filepath, newdata, a, b):
data = pd.read_excel(filepath, sheet_name = 'Test1')
if data[0] == 1:
I am stuck here guys and I am also even not sure if the 'if' statement is right or not. I am looking for suggestions and help.
Info: excel sheet has 1 row and 13 columns, newdata is also a 2-D Matrix
Try running that code and printing out your dataframe (print(data)). You will see that a dataframe is different than a MATLAB matrix. read_excel will try to infer your columns, so you will probably have no rows and just columns. To prevent pandas from reading the column use:
data = pd.read_excel(filepath, sheet_name='Test1', header=None)
Accessing data using an index will index that row. So your comparison is trying to find if the row is equal to 1 (which is never true in your case). To index a given cell, you must first index the row. To achieve what you are doing in MATLAB, use the iloc indexer on your dataframe: data.iloc[0,0]. What this does in accesses row 0, element 0. Your code should look like this:
import pandas as pd
def test_func(filepath, newdata, a, b):
data = pd.read_excel(filepath, sheet_name = 'Test1')
if data.iloc[0,0] == 1:
newdata.iloc[0,2:6] = data.iloc[0,1:5]
....
I suggest you read up on indexing in pandas.
Related
I'm trying to put some sort of length validation for columns using Pandas. For example, let's say I have a csv named test.csv that has the following data within it:
Column1,Column2,Column3
Data1,Data2,DataDataData3
Data1,Data2,Data3
Now, let's say I have a SQL table called [dbo].[Test1] with the following column datatypes and lengths:
CREATE TABLE [dbo].[Test1](Column1 VARCHAR(5),Column2 VARCHAR(5),Column3 VARCHAR(5))
Now, the scenario- I'm trying to use Pandas read_csv tp pick up this test.csv and then use to_sql to import this data. The code within Pandas would look similar to this (Obviously with more implicit design to pick up multiple files in a directory):
import pandas as pd
file = 'C:\Users\test\Documents\test.csv'
df = pd.read_csv(file, skip_blank_lines = True, warn_bad_lines = True)
df.to_sql(schema='dbo', name='Test1', con=conn, if_exists='append', index=False)
The conn is my connection string variable, but that's not the issue. When this would be ran, it will throw an error since the Column3 data is too big in the first row (13) for the length set in SQL for column 3 (5). My question is- Is there a way in Pandas to either reject this record and import the record that doesn't have an issue?
I'm trying to find something on length validation for Pandas to_sql, but I'm coming up at a loss.
Thank you
If you want to remove those rows with strings that have length > 5 before importing to sql, the below should work in between pd.read_csv() and df.to_sql().
df = df[df['Column3'].apply(lambda x: len(x) <= 5)])
Or you could do a quick for loop, like the below:
for col in df.columns.to_list():
df = df[df[col].apply(lambda x: len(x) <= 5)])
Here's the logic I ended up using, but I can't use it on multiple columns:
for i, row in df.iterrows():
if len(row['Column3']) > 5:
df.drop(index = i, inplace = True)
The only way I found to use is on multiple columns is creating multiple if statements like so:
for i, row in df.iterrows():
if len(row['Column1']) > 5:
df.drop(index = i, inplace = True)
if len(row['Column2']) > 5:
df.drop(index = i, inplace = True)
if len(row['Column3']) > 5:
df.drop(index = i, inplace = True)
I don't think this is the most efficient way, but it does work. Also, I've not tested to see how much this increases the time it takes to import.
I'm playing with some data from an Excel file. I imported the file, made it into a dataframe, and now want to iterate over a column named 'Category' for certain keywords, fine them, and retun another column ('Asin'). I'm having trouble finding the correct syntax to make this work.
the code below is my attempt at an if statement:
import pandas as pd
import numpy as np
file = r'C:/Users/bryanmccormack/Downloads/hasbro_dummy_catalog.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df = xl.parse('asins')
df
check = df.loc[df.Category == 'Action Figures'] = 'Asin'
print(check)
Alex Fish provided the correct answer, if I understand the question.
To elaborate, df.loc[df.Category == 'Action Figures'] returns a data frame with the rows that meet the bracketed condition, so ['Asin'] at the end returns the "Asin" column from that data frame.
Fyi,
check = df.loc[df.Category == 'Action Figures'] = 'Asin'
This is a multiple assignment statement - that is,
a = b = 4
is the same as
b = 4
a = b
So your code is apparently rewriting some values of your data frame df, which you probably don't want.
I would like to import excel tables (made by using the Excel 2007 and above tabulating feature) in a workbook into separate dataframes. Apologies if this has been asked before but from my searches I couldn't find what I wanted. I know you can easily do this using the read_excel function however this requires the specification of a Sheetname or returns a dict of dataframes for each sheet.
Instead of specifying sheetname, I was wondering whether there was a way of specifying tablename or better yet return a dict of dataframes for each table in the workbook.
I know this can be done by combining xlwings with pandas but was wondering whether this was built-into any of the pandas functions already (maybe ExcelFile).
Something like this:-
import pandas as pd
xls = pd.ExcelFile('excel_file_path.xls')
# to read all tables to a map
tables_to_df_map = {}
for table_name in xls.table_names:
table_to_df_map[table_name] = xls.parse(table_name)
Although not exactly what I was after, I have found a way to get table names with the caveat that it's restricted to sheet name.
Here's an excerpt from the code that I'm currently using:
import pandas as pd
import openpyxl as op
wb=op.load_workbook(file_location)
# Connecting to the specified worksheet
ws = wb[sheetname]
# Initliasing an empty list where the excel tables will be imported
# into
var_tables = []
# Importing table details from excel: Table_Name and Sheet_Range
for table in ws._tables:
sht_range = ws[table.ref]
data_rows = []
i = 0
j = 0
for row in sht_range:
j += 1
data_cols = []
for cell in row:
i += 1
data_cols.append(cell.value)
if (i == len(row)) & (j == 1):
data_cols.append('Table_Name')
elif i == len(row):
data_cols.append(table.name)
data_rows.append(data_cols)
i = 0
var_tables.append(data_rows)
# Creating an empty list where all the ifs will be appended
# into
var_df = []
# Appending each table extracted from excel into the list
for tb in var_tables:
df = pd.DataFrame(tb[1:], columns=tb[0])
var_df.append(df)
# Merging all in one big df
df = pd.concat(var_df,axis=1) # This merges on columns
I have one excel sheet with right format(Certain number of headers and specific names). Here I have another excel sheet and I have to check this excel sheet for right format or not(have to be the same number of header and same header names, no issue if the values below header will changed.). how can solve this issue ? NLP or any other suitable method is there?
If you have to compare two Excel you could try something like this (I add also some example Excels):
def areHeaderExcelEqual(excel1, excel2) :
equals = True
if len(excel1.columns) != len(excel2.columns):
return False
for i in range(len(excel1.columns)):
if excel1.columns[i] != excel2.columns[i] :
equals = False
return equals
And that's an application:
import pandas as pd
#create first example Excel
df_out = pd.DataFrame([('string1',1),('string2',2), ('string3',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp1.xlsx')
#create second example Excel
df_out = pd.DataFrame([('string5',1),('string2',5), ('string2',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp2.xlsx')
# create third example Excel
df_out = pd.DataFrame([('string1',1),('string4',2), ('string3',3)], columns=['MyName', 'MyValue'])
df_out.to_excel('tmp3.xlsx')
excel1 = pd.read_excel('tmp1.xlsx')
excel2 = pd.read_excel('tmp2.xlsx')
excel3 = pd.read_excel('tmp3.xlsx')
print(areHeaderExcelEqual(excel1, excel2))
print(areHeaderExcelEqual(excel1, excel3))
Note: Excel's files are provided just to see the different outputs.
For example, excel1 looks like this:
The idea is the same for the other files. To have more insights, see How to create dataframes.
Here's you're code:
f1 = pd.read_excel('file1.xlsx')
f2 = pd.read_excel('file2.xlsx')
print(areHeaderExcelEqual(f1, f2))
You can use pandas for that comparison.
import pandas as pd
f1 = pd.read_excel('sheet1.xlsx')
f2 = pd.read_excel('sheet2.xlsx')
header_threshold = 5 # any number of headers
print(len(f1.columns) == header_threshold)
print(f1.columns) # get the column names as values
Is it possible to load matlab tables in python using scipy.io.loadmat?
What I'm doing:
In Matlab:
tab = table((1:500)')
save('tab.mat', 'tab')
In Python:
import scipy.io
mat = scipy.io.loadmat('m:/tab.mat')
But I cannot access the table tab in Python using mat['tab']
The answer to your question is no. Many matlab objects can be loaded in python. Tables, among others, can not be loaded. See Handle Data Returned from MATLAB to Python
The loadmat function doesn't load MATLAB tables. Instead a small workaround can be done. The tables can be saves as .csv files which can then be read using pandas.
In MATLAB
writetable(table_name, file_name)
In Python
df = pd.read_csv(file_name)
At the end, the DataFrame df will have the contents of table_name
I've looked into this for a project I'm working on, and as a workaround, you could try the following.
In MATLAB, first convert the #table object into a struct, and retrieve the column names using:
table_struct = struct(table_object);
table_columns = table_struct.varDim.labels;
save table_as_struct table_struct table_columns;
And then you can try the following code in python:
import numpy
import pandas as pd
import scipy.io
# function to load table variable from MAT-file
def loadtablefrommat(matfilename, tablevarname, columnnamesvarname):
"""
read a struct-ified table variable (and column names) from a MAT-file
and return pandas.DataFrame object.
"""
# load file
mat = scipy.io.loadmat(matfilename)
# get table (struct) variable
tvar = mat.get(tablevarname)
data_desc = mat.get(columnnamesvarname)
types = tvar.dtype
fieldnames = types.names
# extract data (from table struct)
data = None
for idx in range(len(fieldnames)):
if fieldnames[idx] == 'data':
data = tvar[0][0][idx]
break;
# get number of columns and rows
numcols = data.shape[1]
numrows = data[0, 0].shape[0]
# and get column headers as a list (array)
data_cols = []
for idx in range(numcols):
data_cols.append(data_desc[0, idx][0])
# create dict out of original table
table_dict = {}
for colidx in range(numcols):
rowvals = []
for rowidx in range(numrows):
rowval = data[0,colidx][rowidx][0]
if type(rowval) == numpy.ndarray and rowval.size > 0:
rowvals.append(rowval[0])
else:
rowvals.append(rowval)
table_dict[data_cols[colidx]] = rowvals
return pd.DataFrame(table_dict)
Based on Jochens answer i propose a different variant that does a good job for me.
I wrote a Matlab Script to prepare the m-file automatically (see my GitLab Repositroy with examples).
It does the following:
In Matlab for class table:
Does the same like Jochens example, but binds the data together. So it is easier to load multiple variables. The names "table" and "columns" are mandatory for the next part.
YourVariableName = struct('table', struct(TableYouWantToLoad), 'columns', {struct(TableYouWantToLoad).varDim.labels})
save('YourFileName', 'YourVariableName')
In Matlab for class dataset:
Alternative, if you have to handle the old dataset type.
YourVariableName = struct('table', struct(DatasetYouWantToLoad), 'columns', {get(DatasetYouWantToLoad,'VarNames')})
save('YourFileName', 'YourVariableName')
In Python:
import scipy.io as sio
mdata = sio.loadmat('YourFileName')
mtable = load_table_from_struct(mdata['YourVariableName'])
with
import pandas as pd
def load_table_from_struct(table_structure) -> pd.DataFrame():
# get prepared data structure
data = table_structure[0, 0]['table']['data']
# get prepared column names
data_cols = [name[0] for name in table_structure[0, 0]['columns'][0]]
# create dict out of original table
table_dict = {}
for colidx in range(len(data_cols)):
table_dict[data_cols[colidx]] = [val[0] for val in data[0, 0][0, colidx]]
return pd.DataFrame(table_dict)
It is independent from loading the file, but basically a minimized versions of Jochens Code. So please give him kudos for his post.
As others have mentioned, this is currently not possible, because Matlab has not documented this file format. People are trying to reverse engineer the file format but this is a work in progress.
A workaround is to write the table to CSV format and to load that using Python. The entries in the table can be variable length arrays and these will be split across numbered columns. I have written a short function to load both scalars and arrays from this CSV file.
To write the table to CSV in matlab:
writetable(table_name, filename)
To read the CSV file in Python:
def load_matlab_csv(filename):
"""Read CSV written by matlab tablewrite into DataFrames
Each entry in the table can be a scalar or a variable length array.
If it is a variable length array, then Matlab generates a set of
columns, long enough to hold the longest array. These columns have
the variable name with an index appended.
This function infers which entries are scalars and which are arrays.
Arrays are grouped together and sorted by their index.
Returns: scalar_df, array_df
scalar_df : DataFrame of scalar values from the table
array_df : DataFrame with MultiIndex on columns
The first level is the array name
The second level is the index within that array
"""
# Read the CSV file
tdf = pandas.read_table(filename, sep=',')
cols = list(tdf.columns)
# Figure out which columns correspond to scalars and which to arrays
scalar_cols = [] # scalar column names
arr_cols = [] # array column names, without index
arrname2idxs = {} # dict of array column name to list of integer indices
arrname2colnames = {} # dict of array column name to list of full names
# Iterate over columns
for col in cols:
# If the name ends in "_" plus space plus digits, it's probably
# from an array
if col[-1] in '0123456789' and '_' in col:
# Array col
# Infer the array name and index
colsplit = col.split('_')
arr_idx = int(colsplit[-1])
arr_name = '_'.join(colsplit[:-1])
# Store
if arr_name in arrname2idxs:
arrname2idxs[arr_name].append(arr_idx)
arrname2colnames[arr_name].append(col)
else:
arrname2idxs[arr_name] = [arr_idx]
arrname2colnames[arr_name] = [col]
arr_cols.append(arr_name)
else:
# Scalar col
scalar_cols.append(col)
# Extract all scalar columns
scalar_df = tdf[scalar_cols]
# Extract each set of array columns into its own dataframe
array_df_d = {}
for arrname in arr_cols:
adf = tdf[arrname2colnames[arrname]].copy()
adf.columns = arrname2idxs[arrname]
array_df_d[arrname] = adf
# Concatenate array dataframes
array_df = pandas.concat(array_df_d, axis=1)
return scalar_df, array_df
scalar_df, array_df = load_matlab_csv(filename)