Iterate over Excel sheets, clean them up and concatenate - python

The below code will iterate through all Sheets, change them and concatenate.
import pandas as pd
sheets_dict = pd.read_excel('Royalties Jan to Dec 21.xlsx', sheet_name=None)
all_sheets = []
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.fillna('')
sheet.columns = (sheet.iloc[2] + ' ' + sheet.iloc[3])
sheet = sheet[sheet.iloc[:,0] == 'TOTAL']
all_sheets.append(sheet)
full_table = pd.concat(all_sheets)
full_table.reset_index(inplace=True, drop=True)
full_table.to_excel('output.xlsx')
However, when I execute the code, I get the following error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I have pinpointed the issue to the following line:
sheet.columns = (sheet.iloc[2] + ' ' + sheet.iloc[3])
This line is supposed to merge two rows together:
Would anyone know what am I doing wrong? Thank you.

Related

List of sheet_names then read under a condition

What I'm trying to do, I think I'm close to, is to read an excel file, create a list of all sheet_names found in the file, and then if the name contains "Activity" then read it and then concatenate it in a data frame. I would appreciate your expertise on this matter.
df_ac = pd.DataFrame()
wb = load_workbook(r"C:\Users\bosuna\OneDrive\Desktop" + '\\' + ffile, read_only=True)
sheets = [s for s in wb if "Activity" in wb]
for sheet in sheets:
df = pd.read_excel(r"C:\Users\bosuna\OneDrive\Desktop" + '\\' + ffile, sheet_name=sheet, read_only=True)
df_ac = pd.concat([df, df_ac], ignore_index=True)
you can use append as well instead of concat..
import pandas as pd
x1 = pd.ExcelFile('name of file.xlsx')
empty_df = pd.DataFrame()
for sheet in x1.sheet_names: # iterate through sheet name
if 'ACTIVITY' in sheet: # check wether activity is present in sheet name
df = pd.read_excel('name of file.xlsx',sheet_name = sheet)
empty_df = empty_df.append(df)
I'm not entirely sure if I understood, but I'm assuming you are trying to load the "Activity" sheet into a panda dataframe if it exists within the workbook. If that is the case I propose the following solution:
wb = load_workbook(r"C:\Users\bosuna\OneDrive\Desktop" + '\\' + ffile, read_only=True)
if 'Activity' in wb.sheetnames:
df = pd.DataFrame(wb['Activity' ].values)
In the list comprehension (line 4), you are checking for "Activity" in wb, instead of s (sheet name)
This is the equivalent of only iterating over all sheets if one of them is "Activity".
From what I understand, the expected behaviour is to only read a sheet, if the sheet name itself contains the phrase "Activity", and this, is different from the actual behaviour.
Replace the 4th line with this
sheets = [s for s in wb if "Activity" in s]

Error with combining multiple workbook and sheets

I wondered if anyone could help. I am combining multiple excel files that have between 1-3 sheets. I want to combine these sheets into 3 dataframes and have done so using the below code:
all_workbook1 = pd.DataFrame()
all_workbook2 = pd.DataFrame()
all_workbook3 = pd.DataFrame()
for f in glob.glob("*.xlsx"):
dfworkbook1 = pd.read_excel(f, sheet_name="sheet1", usecols="B:AO")
dfworkbook1["Filename"] = "[" + os.path.basename(f) + "]"
all_workbook1 = all_workbook1.append(dfworkbook1,ignore_index=True)
dfworkbook2 = pd.read_excel(f, sheet_name="sheet2", usecols="B:AO")
dfworkbook2["Filename"] = "[" + os.path.basename(f) + "]"
all_workbook2 = all_workbook2.append(dfworkbook2,ignore_index=True)
dfworkbook3 = pd.read_excel(f, sheet_name="sheet3", usecols="B:AO")
dfworkbook3["Filename"] = "[" + os.path.basename(f) + "]"
all_workbook3 = all_workbook3.append(dfworkbook3,ignore_index=True)
When running this I can get the below error:
xlrd.biffh.XLRDError: No sheet named <'sheet3'>
I believe this is due to the fact that not all of my files have 'sheet3'. What is the best process to avoid this? I have tried to add code to the start that runs over the files and adds the missing sheets as a blank sheet but have been struggling with this.
Any help would be great.
Thanks,
Dan
Consider using a defined method that runs try/except to account for potentially missing sheets. Then call method within several list comprehensions for corresponding sheet data frame lists that are ultimately concatenated together:
def read_xl_data(file, sh):
try:
df = (pd.read_excel(file, sheet_name=sh, usecols="B:AO")
.assign(Filename = f"[{os.path.basename(file)}]"))
except:
df = pd.DataFrame()
return df
# LIST COMPREHENSIONS TO RETRIEVE SPECIFIC SHEETS
sheet1_dfs = [read_xl_data(f, "sheet1") for f in glob.glob("*.xlsx")]
sheet2_dfs = [read_xl_data(f, "sheet2") for f in glob.glob("*.xlsx")]
sheet3_dfs = [read_xl_data(f, "sheet3") for f in glob.glob("*.xlsx")]
# CONCAT CORRESPONDING SHEET DFS TOGETHER
all_workbook1 = pd.concat(sheet_1_dfs)
all_workbook2 = pd.concat(sheet_2_dfs)
all_workbook3 = pd.concat(sheet_3_dfs)

Python Pandas ExcelWriter append to sheet creates a new sheet

I would I really appreciate some help.
I'm trying to use a loop to create sheets, and add data to those sheets for every loop. The position of my data is correct, however Panda ExcelWriter creates a new sheet instead of appending to the one created the first time the loop runs.
I'm a beginner, and right function is over form, so forgive me.
My code:
import pandas as pd
# initial files for dataframes
excel_file = 'output.xlsx'
setup_file = 'setup.xlsx'
# write to excel
output_filename = 'output_final.xlsx'
df = pd.read_excel(excel_file) # create dataframe of entire sheet
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_setup = pd.read_excel(setup_file)
df_setup.columns = df_setup.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_2 = pd.merge(df, df_setup) # Merge data with setup to have krymp size for each wire in dataframe
df_2['wirelabel'] = "'" + df_2['cable'] + "_" + df_2['function_code'] + "-" + df_2['terminal_strip'] + ":" + df_2[
'terminal'] # creates column for the wirelabel by appending columns with set delimiters. #TODO: delimiters to be by inputs.
df_2.sort_values(by=['switchboard']) # sort so we get proper order
switchboard_unique = df.switchboard.unique().tolist() # crate variable containing unique switchboards for printing to excel sheets
def createsheets(output_filename, sheetname, row_start, column_start, df_towrite):
with pd.ExcelWriter(output_filename, engine='openpyxl', mode='a') as writer:
df_towrite.to_excel(writer, sheet_name=sheetname, columns=['wirelabel'], startrow=row_start, startcol=column_start, index=False, header=False)
writer.save()
writer.close()
def sorter():
for s in switchboard_unique:
df_3 = df_2.loc[df_2['switchboard'] == s]
krymp_unique = df_3.krymp.unique().tolist()
krymp_unique.sort()
# print(krymp_unique)
column_start = 0
row_start = 0
for k in krymp_unique:
df_3.loc[df_3['krymp'] == k]
# print(k)
# print(s)
# print(df_3['wirelabel'])
createsheets(output_filename, s, row_start, column_start, df_3)
column_start = column_start + 1
sorter()
current behavior:
if sheetname is = sheet, then my script creates sheet1, sheet2, sheet3..etc.
pictureofcurrent
Wanted behavior
Create a sheet for each item in "df_3", and put data into columns according to the position calculated in column_start. The position in my code works, just goes to the wrong sheet.
pictureofwanted
I hope it's clear what im trying to accomplish, and all help is appriciated.
I tried all example codes i have sound regarding writing to excel.
I know my code is not a work of art, but I will update this post with the answer to my own question for the sake of completeness, and if anyone stumbles on this post.
It turns out i misunderstood the capabilities of the "append" function in Pandas "pd.ExcelWriter". It is not possible to append to a sheet already existing, the sheet will get overwritten though mode is set to 'a'.
Realizing this i changed my code to build a dataframe for the entire sheet (df_sheet), an then call the "createsheets" function in my code. The first version wrote my data column by column.
"Final" code:
import pandas as pd
# initial files for dataframes
excel_file = 'output.xlsx'
setup_file = 'setup.xlsx'
# write to excel
output_filename = 'output_final.xlsx'
column_name = 0
df = pd.read_excel(excel_file) # create dataframe of entire sheet
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_setup = pd.read_excel(setup_file)
df_setup.columns = df_setup.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_2 = pd.merge(df, df_setup) # Merge data with setup to have krymp size for each wire in dataframe
df_2['wirelabel'] = "'" + df_2['cable'] + "_" + df_2['function_code'] + "-" + df_2['terminal_strip'] + ":" + df_2[
'terminal'] # creates column for the wirelabel by appending columns with set delimiters. #TODO: delimiters to be by inputs.
df_2.sort_values(by=['switchboard']) # sort so we get proper order
switchboard_unique = df.switchboard.unique().tolist() # crate variable containing unique switchboards for printing to excel sheets
def createsheets(output_filename, sheetname, df_towrite):
with pd.ExcelWriter(output_filename, engine='openpyxl', mode='a') as writer:
df_towrite.to_excel(writer, sheet_name=sheetname, index=False, header=True)
def to_csv_file(output_filename, df_towrite):
df_towrite.to_csv(output_filename, mode='w', index=False)
def sorter():
for s in switchboard_unique:
df_3 = df_2.loc[df_2['switchboard'] == s]
krymp_unique = df_3.krymp.unique().tolist()
krymp_unique.sort()
column_start = 0
row_start = 0
df_sheet = pd.DataFrame([])
for k in krymp_unique:
df_5 = df_3.loc[df_3['krymp'] == k]
df_4 = df_5.filter(['wirelabel'])
column_name = "krymp " + str(k) + " Tavle: " + str(s)
df_4 = df_4.rename(columns={"wirelabel": column_name})
df_4 = df_4.reset_index(drop=True)
df_sheet = pd.concat([df_sheet, df_4], axis=1)
column_start = column_start + 1
row_start = row_start + len(df_5.index) + 1
createsheets(output_filename, s, df_sheet)
to_csv_file(s + ".csv", df_sheet)
sorter()
Thank you.

adding data to an existing empty dataframe containing only column names

How can I add data to an existing empty column in a dataframe?
I have an empty dataframe with column names (stock tickers)
I am trying to add data to each stock, basically, populate the dataframe column by column, from left to right based on the header name.
I am pulling the data from another CSV file which looks like this (CSV file name = column name in the dataframe Im trying to populate):
PS aditional issue may arise due to the length of data available for each stock, eg. I may have a list of 10 values for the first stock, 0 for the second, and 25 for third. I plan to save this in a CSV, so perhaps it could not cause too big of an issue.
I have tried the following idea but without luck. any suggestions are welcome.
import pandas as pd
import os
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.read_csv(path + Russell3k_Tickers)
divFls = os.listdir(path + Russell3k_Divs)
for i in divFls:
df = pd.read_csv(path + Russell3k_Divs + i)
Div = df['Dividends']
i = i[0].split('.')
df_tickers[i] = df_tickers.append(Div)
print(df_tickers)
break
import pandas as pd
import os
from tqdm import tqdm
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.DataFrame()
divFls = os.listdir(path + Russell3k_Divs)
for i in tqdm(divFls):
df = pd.read_csv(path + Russell3k_Divs + i)
i = i.split('.')[0]
df[str(i)] = df['Date']
df_tickers = df_tickers.join(df[str(i)], how='outer')
df_tickers.to_csv('Russell-3000-Stock-Tickers-List1.csv', encoding='utf-8', index=False)
This answer was posted as an edit to the question adding data to an existing empty dataframe containing only column names by the OP Mr.Riply under CC BY-SA 4.0.

Combining Excel worksheets over multiple loops

I've got a number of Excel workbooks, each with multiple worksheets, that I'd like to combine.
I've set up two sets of loops (one while, one for) to read in rows for each sheet in a given workbook and then do the same for all workbooks.
I tried to do it on a subset of these, and it appears to work until I try to combine the two sets using the pd.concat function. Error given is
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea what I'm doing incorrectly?
import pandas as pd
d = 2013
numberOfSheets = 5
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1,numberOfSheets+1):
data = pd.read_excel(f, sheetname = 'Table '+str(i), header=None)
print(i)
df.append(data)
print(str(d) + ' complete')
print(df)
d += 1
df = pd.concat(df)
print(df)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df.to_excel(final)
As the error says, pd.concat() requires an iterable, like a list: pd.concat([df1, df2]) will concatenate df1 and df2 along the default axis of 0, which means df2 is appended to the bottom of df1.
Two issues need fixing:
The for loop refers to df before assigning anything to it.
The variable df is overwritten with each iteration of the for loop.
One workaround is to create an empty list of DataFrames before the loops, then append DataFrames to that list, and finally concatenate all the DataFrames in that list. Something like this:
import pandas as pd
d = 2013
numberOfSheets = 5
dfs = []
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1, numberOfSheets + 1):
data = pd.read_excel(f, sheetname='Table ' + str(i), header=None)
print(i)
dfs.append(data)
print(str(d) + ' complete')
print(df)
d += 1
# ignore_index=True gives the result a default IntegerIndex
# starting from 0
df_final = pd.concat(dfs, ignore_index=True)
print(df_final)
final_path = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final_path)
Since I can't comment, I'll leave this as an answer: you can speed up this code by opening the file once then parsing the workbook to get each sheet. Should save a second or two off each iteration, since opening the Excel file takes the longest. Here's some code that might help.
Note: setting sheet_name=None will return ALL the sheets in the workbook:
dfs = {<sheetname1>: <DataFrame1>, <sheetname2>: <DataFrame2>, etc.}
Here's the code:
xl = pd.ExcelFile(fpath)
dfs = xl.parse(sheetname=None, header=None)
for i, df in enumerate(dfs):
<do stuff with each, if you want>
print('Sheet {0} looks like:\n{1}'.format(i+1, df))
Thank you, both. I accepted the answer that addressed the specific question, but was able to use the second answer and some additional googling thereafter (eg, glob) to amend the original code, and automate more fully independent of number of workbooks or worksheets.
Final version of the above now below:
import pandas as pd
import glob
#import numpy as np
#import os, collections, csv
#from os.path import basename
fpath = "H:/MyDocuments/Z Project Work/"
dfs = []
files = glob.glob(fpath+'*.xlsx')
for f in files:
xl = pd.ExcelFile(f)
xls = xl.parse(sheetname=None, header=0)
for i, df in enumerate(xls):
print(i)
dfs.append(xls[df])
print(f+ ' complete')
df_final = pd.concat(dfs, ignore_index=True)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final)

Categories