How To Turning A MultiIndex Dataframe into a Singular Index Dataframe? - python

I have the current code that is used to read a file and then remove 1 line from the top of the dataframe until a specific value is 'ACCIDENT ID'.
def read_file(file):
"""
This function reads the Excel file, chooses the sheet that contains the information that we need.
The sheet is then read and the dataframe is created.
"""
df = pd.ExcelFile(file)
# Setting string to obtain correct sheet name.
sheet_prefix = 'ITD_'
# Go through each sheet and obtain the one with ITD_.
for sheet_name in df.sheet_names:
if sheet_prefix in sheet_name:
read_sheet = sheet_name
else:
invalid_sheet_name = True
df_read = pd.read_excel(file, sheet_name=str(read_sheet))
df_duplicate = df_read.copy()
# Check if the first cell contains ACCIDENT ID.
while df_duplicate.columns.values[0][0] != 'ACCIDENT ID':
columns_list = [df_duplicate.iloc[0].values]
df_duplicate.columns = columns_list
df_duplicate = df_duplicate.iloc[1:]
df_before = df_duplicate
df_duplicate = df_duplicate.dropna(how='all', axis='columns')
df_duplicate = df_duplicate.reset_index(drop=True)
return df_duplicate
What I don't understand is that when I read the file into excel, the dataframe is a single index. However at the end of the function the dataframe that is returned is now multiindex? I thought reset_index would turn the dataframe back into a singular index. This ends up messing up my functions later as the columns are in a different form. Does anyone know how to turn the dataframe back so the column headers are of a singular index?

Found out that the issues that I was having was because I was assigning the value in the rows to the columns and that meant that a multi index dataframe was being created. In order to fix this, I needed to make sure that the values were being flattened and then being turned into a list. The following code fixes this issue.
column_regex = df_duplicate.columns.values.flatten().tolist()
Then assigning this list to the column headers fixed my issue and maintained the dataframe as a singular index dataframe.

Related

Can't figure out why pandas.concat is creating extra column when concatenating two frames

I've been trying to concatenate two sheets while preserving the original indices of both dataframes. However, upon concatenation I can't seem to get the result to output the way I expect or want.
If I use ignore_index = True The old indices are replaced by an index that encompasses both sheets total rows.
If I use ignore_index = False The indices are preserved, but there is a new empty column preceding it shifting the rest of columns over by one.
How can I concatenate my sheets without this excess column?
import pandas as pd
import easygui
path = easygui.fileopenbox("Select a file")
xls = pd.ExcelFile(path)
potential_names = [sheet for sheet in xls.sheet_names if sheet.startswith('Numbers_')]
df = pd.concat(pd.read_excel(xls, sheet_name = potential_names), ignore_index = True)
command_list = ["AA", "dd", "DD"]
warning_list = ["Dd", "dD"]
ingenium_list = ["CC", "BB"]
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_source = 'SOURCE'
df_filtered_command = df[df[col_name_type].isin(command_list)]
df_filtered_warnings = df[df[col_name_type].isin(warning_list)]
df_filtered_other = df[df[col_name_other].isin(ingenium_list)]
with pd.ExcelWriter(('(Processed) ' + os.path.basename(path))) as writer:
#df_filtered_command.to_excel(writer, sheet_name="Command", index=True)
df_final_command.to_excel(writer, sheet_name=("Command"), index=True)
df_filtered_warnings.to_excel(writer, sheet_name="Warnings", index=True)
df_filtered_other.to_excel(writer, sheet_name="Issues", index=True)
I suspect how the concat function is working, but I've not been able to figure out how to fix it.
Any help or direction would be amazing.
Edit: adding an example of my df after running.
I was mistaken before, it seems the first column is empty aside from the sheet names, but I'm still not able to find a way to prevent pandas from making that first column if I don't remake the index.
Since you passed in a dictionary (not a list) of data frames from using a list in sheet_names of pandas.read_excel, pandas.concat will preserve the dict keys, as specified for the first argument:
objs: a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below).
Consequently, the sheet names (i.e., dict keys) and the index values of each data frame migrate as a MultIindex. Consider using the names argument to name the indexes:
names: list, default None
Names for the levels in the resulting hierarchical index.
sheets_df = pd.concat(
pd.read_excel(xls, sheet_name = potential_names),
names = ["sheet_name", "index"]
)
If you want to convert the dual index as new columns, simple run reset_index afterwards:
sheets_df = (
pd.concat(
pd.read_excel(xls, sheet_name = potential_names),
names = ["sheet_name", "index"]
).reset_index()
)

call dataframe from list of dataframes python

I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop

Csv column unnamed headers being written. How do I stop that, it adds the row number on the left whenever I run the program, offsetting index writing

I am trying to replace a certain cell in a csv but for some reason the code keeps adding this to the csv:
,Unnamed: 0,User ID,Unnamed: 1,Unnamed: 2,Balance
0,0,F7L3-2L3O-8ASV-1CG4,,,5.0
1,1,YP2V-9ERY-6V3H-UG1A,,,4.0
2,2,9FPM-879N-3BKG-ZBX8,,,0.0
3,3,1CY4-47Y8-6317-UQTK,,,5.0
4,4,H9BP-5N77-7S2T-LLMG,,,100.0
It should look like this:
User ID,,,Balance
F7L3-2L3O-8ASV-1CG4,,,5.0
YP2V-9ERY-6V3H-UG1A,,,4.0
9FPM-879N-3BKG-ZBX8,,,0.0
1CY4-47Y8-6317-UQTK,,,5.0
H9BP-5N77-7S2T-LLMG,,,100.0
My code is:
equations_reader = pd.read_csv("bank.csv")
equations_reader.to_csv('bank.csv')
add_e_trial = equations_reader.at[bank_indexer_addbalance, 'Balance'] = read_balance_add + coin_amount
In summary, I want to open the CSV file, make a change and save it again without Pandas adding an index and without it modifying empty columns.
Why is it doing this? How do I fix it?
Pandas as you have seen will allocate Unnamed:xxx column names to empty column headers. These columns can either be removed or renamed.
When saving, by default Pandas will add a numbered index column, this is optional and can be removed by adding an index=False parameter.
For example:
import pandas as pd
df = pd.read_csv("bank.csv")
# Rename any unnamed columns
df = df.rename(columns=lambda x: '' if x.startswith('Unnamed') else x)
# Remove any unnamed columns
# df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
# << update cells >>
df.to_csv('bank2.csv', index=False)
This would rename any column names that start Unnamed to an empty string. This approach should result in bank.csv only having your updated cells applied.

Summary Row for a pd.DataFrame with multiindex

I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.

How do I verify specific Cell Value in an Excel Sheet through Pandas Data Frame in Python3.7

I have an Excel Sheet having some values, one such cell has Yes/No written in it. Since I have linked the entire file in Python and made a Data Frame of it, how do I verify specific cell values so that only the rows with No value can be printed?
import pandas as pd
data = pd.read_excel(r'/Volumes/SSD/Project/Raw_Data/Light.xlsx')
df1 = pd.DataFrame(data)
df3 = pd.DataFrame(data , columns = ['Person in Room'] ) //Help Needed
Do these following steps:
# Takes in Light.xlsx as dataframe
datacheck = pd.read_excel(r'/Volumes/SSD/Project/Raw_Data/Light.xlsx')
# Prints all rows with NaN values
datacheck[datacheck.isna().any(axis=1)]

Categories