List of sheet_names then read under a condition - python

What I'm trying to do, I think I'm close to, is to read an excel file, create a list of all sheet_names found in the file, and then if the name contains "Activity" then read it and then concatenate it in a data frame. I would appreciate your expertise on this matter.
df_ac = pd.DataFrame()
wb = load_workbook(r"C:\Users\bosuna\OneDrive\Desktop" + '\\' + ffile, read_only=True)
sheets = [s for s in wb if "Activity" in wb]
for sheet in sheets:
df = pd.read_excel(r"C:\Users\bosuna\OneDrive\Desktop" + '\\' + ffile, sheet_name=sheet, read_only=True)
df_ac = pd.concat([df, df_ac], ignore_index=True)

you can use append as well instead of concat..
import pandas as pd
x1 = pd.ExcelFile('name of file.xlsx')
empty_df = pd.DataFrame()
for sheet in x1.sheet_names: # iterate through sheet name
if 'ACTIVITY' in sheet: # check wether activity is present in sheet name
df = pd.read_excel('name of file.xlsx',sheet_name = sheet)
empty_df = empty_df.append(df)

I'm not entirely sure if I understood, but I'm assuming you are trying to load the "Activity" sheet into a panda dataframe if it exists within the workbook. If that is the case I propose the following solution:
wb = load_workbook(r"C:\Users\bosuna\OneDrive\Desktop" + '\\' + ffile, read_only=True)
if 'Activity' in wb.sheetnames:
df = pd.DataFrame(wb['Activity' ].values)

In the list comprehension (line 4), you are checking for "Activity" in wb, instead of s (sheet name)
This is the equivalent of only iterating over all sheets if one of them is "Activity".
From what I understand, the expected behaviour is to only read a sheet, if the sheet name itself contains the phrase "Activity", and this, is different from the actual behaviour.
Replace the 4th line with this
sheets = [s for s in wb if "Activity" in s]

Related

How to get the string/key from a dictionary of data frames in pandas?

Sure it's simple but can't figure it out.
I've got a code that creates a Dataframe from each tab of a xlsx file, then adds the df to a dictionary of data frames with the sheet name as the key (e.g. df_dict['sheet 1'] etc. The code uses a loop where the string variable sheet_name is assigned to the actual sheet name.
I'm now at a point where I want to tell my code to tidy the df up in specific ways according to which df I'm currently referring to, but I can't find how to refer to the Dataframe key. My code uses a string variable 'sheet_name' in place of the actual key as it loops through sheets.
E.g I want to say 'if df_dict[sheet_name] == df_dict['sheet 1'] then do x'
I think I'm basically getting lost of how to refer to the 'string' or 'key' part of my dataframe in the dictionary in that context.
df_dict ={}
counter = 0
list_to_do = ['sheet1','sheet2']
# Defines directory as a path - not just a string
directory = os.fsencode(file_path)
# For each file in the folder
for file in os.listdir(directory):
# Define the filename
filename = os.fsdecode(file)
# If it's a spreadsheet - work on it
if filename.endswith(".xlsx") or filename.endswith(".xlsx") or filename.endswith(".xlsm"):
print("Working on " + filename)
# Add one to the counter to show how many files are worked through and which one you are on
counter = counter+1
print('Counter = ' + str(counter))
# Load the workbook from the given path and filename
wb = load_workbook(filename = file_path + "/" + filename)
# Unprotect each sheet and save with new name
for sheet in wb:
sheet.protection.disable()
# Get the name of the current worksheet
sheet_name = sheet.title
print('Working on ' + sheet_name)
if sheet_name in list_to_do:
# Loads the excel file from the path, specifying the sheet
df_dict[sheet_name] = pd.read_excel(file_path + "/" + filename,sheet_name)
# Get rid of unwanted rows at the top according to the specific sheet
# for all the keys in the dictionary of dataframes with rows to remove at the top
"""if df_dict[sheet_name] == 'sheet1': Do clean up stuff specific to sheet1"""
Based on my understanding of your question, this is how I would tackle the problem.
Can add error handling etc shall you need it
sheet_1 = pd.DataFrame({"a":[1,2,3],"b":[3,4,5]})
sheet_2 = pd.DataFrame({"c":[1,2,3],"d":[3,4,5]})
dict_sheet = {"s1":sheet_1,"s2":sheet_2}
def callback_1(df):
print(len(df))
def callback_2(df):
print(df.shape)
dict_callback = {"s1":callback_1,"s2":callback_2}
for key,callback in dict_callback.items():
if key in dict_sheet.keys():
df = dict_sheet[key]
output = callback(df)
I've got a code that creates a Dataframe from each tab of a xlsx file, then adds the df to a dictionary of data frames with the sheet name as the key (e.g. df_dict['sheet 1'] etc. The code uses a loop where the string variable sheet_name is assigned to the actual sheet name.
You can use sheet_name=None or sheet_name=list_to_do to load your sheets into a dictionary where keys are the sheet names and values are the dataframes:
dfs = pd.read_excel('myfile.xlsx', sheet_name=list_to_do)
if 'sheet1' in dfs:
# do stuff here
dfs['sheet1'] = cleaned_df

Multiple Excel Files as Separate Sheets Using Python

Most of the articles I'm seeing either:
a) Combine multiple excel single-sheet workbooks into one master workbook with just a single sheet or;
b) Split a multiple-sheet excel workbook into individual workbooks.
However, my goal is to grab all the excel files in a specific folder and save them as individual sheets within one new master excel workbook. I'm trying to rename each sheet name as the name of the original file.
import pandas as pd
import glob
import os
file = "C:\\File\\Path\\"
filename = 'Consolidated Files.xlsx'
pth = os.path.dirname(file)
extension = os.path.splitext(file)[1]
files = glob.glob(os.path.join(pth, '*xlsx'))
w = pd.ExcelWriter(file + filename)
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
df.to_excel(w, sheet_name = f, index = False)
w.save()
How do I adjust the names for each sheet? Also, if you see any opportunities to clean this up please let me know
You cannot rename sheet with special characters because f is full path and file name. You should use only filename to names sheetname, Use os.path.basename to get file name and use split to seperate file name and extension.
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
# Use basename to get filename with extension
# Use split to seperate filename and extension
new_sheet_name = os.path.basename(f).split('.')[0]
#
df.to_excel(w, sheet_name = new_sheet_name , index = False)
I decided to put my solution here as well, just in case it would be useful to anyone.
Thing is, I wanted to be able to recall where the end sheet came from. However, source workbooks can (and likely will) often have same sheet names like "Sheet 1", so I couldn't just use sheet names from original workbooks. I also could not use source filenames as sheet names since they might be longer than 31 character, which is maximum sheet name length allowed by Excel.
Therefore, I ended up assigning incremental numbers to resulting sheet names, while simultaneously inserting a new column named "source" at the start of each sheet and populating it with file name concatenated with sheet name. Hope it might help someone :)
from glob import glob
import pandas as pd
import os
files_input = glob(r'C:\Path\to\folder\*.xlsx')
result_DFs = []
for xlsx_file in files_input:
file_DFs = pd.read_excel(xlsx_file, sheet_name=None)
# save every sheet from every file as dataframe to an array
for sheet_DF in file_DFs:
source_name = os.path.basename(xlsx_file) + ":" + sheet_DF
file_DFs[sheet_DF].insert(0, 'source', source_name)
result_DFs.append(file_DFs[sheet_DF])
with pd.ExcelWriter(r'C:\Path\to\resulting\file.xlsx') as writer:
for df_index in range(len(result_DFs)):
# write dataframe to file using simple incremental number as a new sheet name
result_DFs[df_index].to_excel(writer, sheet_name=str(df_index), index=False)
# auto-adjust column width (can be omitted if not needed)
for i, col in enumerate(result_DFs[df_index].columns):
column_len = max(result_DFs[df_index][col].astype(str).str.len().max(), len(col) + 3)
_ = writer.sheets[str(df_index)].set_column(i, i, column_len)

Error with combining multiple workbook and sheets

I wondered if anyone could help. I am combining multiple excel files that have between 1-3 sheets. I want to combine these sheets into 3 dataframes and have done so using the below code:
all_workbook1 = pd.DataFrame()
all_workbook2 = pd.DataFrame()
all_workbook3 = pd.DataFrame()
for f in glob.glob("*.xlsx"):
dfworkbook1 = pd.read_excel(f, sheet_name="sheet1", usecols="B:AO")
dfworkbook1["Filename"] = "[" + os.path.basename(f) + "]"
all_workbook1 = all_workbook1.append(dfworkbook1,ignore_index=True)
dfworkbook2 = pd.read_excel(f, sheet_name="sheet2", usecols="B:AO")
dfworkbook2["Filename"] = "[" + os.path.basename(f) + "]"
all_workbook2 = all_workbook2.append(dfworkbook2,ignore_index=True)
dfworkbook3 = pd.read_excel(f, sheet_name="sheet3", usecols="B:AO")
dfworkbook3["Filename"] = "[" + os.path.basename(f) + "]"
all_workbook3 = all_workbook3.append(dfworkbook3,ignore_index=True)
When running this I can get the below error:
xlrd.biffh.XLRDError: No sheet named <'sheet3'>
I believe this is due to the fact that not all of my files have 'sheet3'. What is the best process to avoid this? I have tried to add code to the start that runs over the files and adds the missing sheets as a blank sheet but have been struggling with this.
Any help would be great.
Thanks,
Dan
Consider using a defined method that runs try/except to account for potentially missing sheets. Then call method within several list comprehensions for corresponding sheet data frame lists that are ultimately concatenated together:
def read_xl_data(file, sh):
try:
df = (pd.read_excel(file, sheet_name=sh, usecols="B:AO")
.assign(Filename = f"[{os.path.basename(file)}]"))
except:
df = pd.DataFrame()
return df
# LIST COMPREHENSIONS TO RETRIEVE SPECIFIC SHEETS
sheet1_dfs = [read_xl_data(f, "sheet1") for f in glob.glob("*.xlsx")]
sheet2_dfs = [read_xl_data(f, "sheet2") for f in glob.glob("*.xlsx")]
sheet3_dfs = [read_xl_data(f, "sheet3") for f in glob.glob("*.xlsx")]
# CONCAT CORRESPONDING SHEET DFS TOGETHER
all_workbook1 = pd.concat(sheet_1_dfs)
all_workbook2 = pd.concat(sheet_2_dfs)
all_workbook3 = pd.concat(sheet_3_dfs)

Reading data from excel and rewriting it with a new column PYTHON

I recently managed to create a program the reads data from excel, edit it and rewrite it along with new columns and it works good, but the issue is the performance if the excel file contains 1000 rows it finishes in less than 2 mins but if it contains 10-15k rows, it can take 3-4 hours and the more I have rows the more it becomes exponentially slower which doesnt make sense for me.
My code:
Reading from xls excel:
def xls_to_dict(workbook_url):
workbook_dict = {}
book = xlrd.open_workbook(workbook_url)
sheets = book.sheets()
for sheet in sheets:
workbook_dict[sheet.name] = {}
columns = sheet.row_values(0)
rows = []
for row_index in range(1, sheet.nrows):
row = sheet.row_values(row_index)
rows.append(row)
return rows
return workbook_dict
data = xls_to_dict(filename)
Writing in the excel:
rb = open_workbook(filename, formatting_info=True)
r_sheet = rb.sheet_by_index(0)
wb = copy(rb)
w_sheet = wb.get_sheet(0)
I read and found a package called Pandas that reads xlsx and tried working on it, but failed to access the data from the DataFrame to be a dictionary. So couldn't edit it and rewrite it to compare the performance.
My code:
fee = pd.read_excel(filename)
My input row data file is:
ID. NAME. FAMILY. DOB Country Description
My output file is:
ID. NAME. FAMILY. DOB Country ModifiedDescription NATIONALITY
Any advice will be appreciated.
You can remove iterating over rows by converting sheet data to a dataframe and get values as list.
from openpyxl import load_workbook
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta
def xls_to_dict(workbook_url):
xl = pd.ExcelFile(workbook_url)
workbook_dict = {}
for sheet in xl.sheet_names:
df = pd.read_excel(xl, sheet)
columns = df.columns
rows = df.values.tolist()
workbook_dict[sheet] = rows
return workbook_dict,columns
data,columns = xls_to_dict(filename)
for saving also you can remove for loop by using a dataframe
xl = pd.ExcelFile(filename)
sheet_name = xl.sheet_names[0] #sheet by index
df = pd.read_excel(xl, sheet_name)
df["DOB"] = pd.to_datetime(df["DOB"])
df["age"] = df["DOB"].apply(lambda x: abs(relativedelta(datetime.today(),x).years))
df["nationality"] = #logic to calculate nationality
book = load_workbook(filename)
writer = pd.ExcelWriter(filename, engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name)
writer.save()

Adding sheet name to the conceited final merged sheet in excel

I want to merge multiple excel sheets to one and to have a new column with the name of the original sheet
I'm using the following code:
list_of_sheets = list(df.keys())
cdf = pd.concat(df[sheet] for sheet in list_of_sheets)
# tried
cdf = pd.concat(df[sheet]["Brand"] for sheet in list_of_sheets)
# and
list_of_sheets = list(df.keys())
for sheet in list_of_sheets:
df[sheet]["Brand"] = sheet
cdf = pd.concat(df[sheet])
but none of them works
Does this accomplish what you want?
import pandas as pd
pd.concat(pd.read_excel("my_excel_file.xlsx", sheet_name=None))
The sheet's names will be the index of the dataframe.
First read the file:
xl = pd.ExcelFile(file)
Which should produce the following:
<pandas.io.excel.ExcelFile at 0x12cad0860>
Then iterate over the sheets, append the sheet name as a separate column and store all dfs in a list:
dfs = []
for sheet in xl.sheet_names:
df = xl.parse(sheet)
df['sheet_name'] = sheet
dfs.append(df)
In order to concat them at last:
pd.concat(dfs)

Categories