Multiple Excel Files as Separate Sheets Using Python - python

Most of the articles I'm seeing either:
a) Combine multiple excel single-sheet workbooks into one master workbook with just a single sheet or;
b) Split a multiple-sheet excel workbook into individual workbooks.
However, my goal is to grab all the excel files in a specific folder and save them as individual sheets within one new master excel workbook. I'm trying to rename each sheet name as the name of the original file.
import pandas as pd
import glob
import os
file = "C:\\File\\Path\\"
filename = 'Consolidated Files.xlsx'
pth = os.path.dirname(file)
extension = os.path.splitext(file)[1]
files = glob.glob(os.path.join(pth, '*xlsx'))
w = pd.ExcelWriter(file + filename)
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
df.to_excel(w, sheet_name = f, index = False)
w.save()
How do I adjust the names for each sheet? Also, if you see any opportunities to clean this up please let me know

You cannot rename sheet with special characters because f is full path and file name. You should use only filename to names sheetname, Use os.path.basename to get file name and use split to seperate file name and extension.
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
# Use basename to get filename with extension
# Use split to seperate filename and extension
new_sheet_name = os.path.basename(f).split('.')[0]
#
df.to_excel(w, sheet_name = new_sheet_name , index = False)

I decided to put my solution here as well, just in case it would be useful to anyone.
Thing is, I wanted to be able to recall where the end sheet came from. However, source workbooks can (and likely will) often have same sheet names like "Sheet 1", so I couldn't just use sheet names from original workbooks. I also could not use source filenames as sheet names since they might be longer than 31 character, which is maximum sheet name length allowed by Excel.
Therefore, I ended up assigning incremental numbers to resulting sheet names, while simultaneously inserting a new column named "source" at the start of each sheet and populating it with file name concatenated with sheet name. Hope it might help someone :)
from glob import glob
import pandas as pd
import os
files_input = glob(r'C:\Path\to\folder\*.xlsx')
result_DFs = []
for xlsx_file in files_input:
file_DFs = pd.read_excel(xlsx_file, sheet_name=None)
# save every sheet from every file as dataframe to an array
for sheet_DF in file_DFs:
source_name = os.path.basename(xlsx_file) + ":" + sheet_DF
file_DFs[sheet_DF].insert(0, 'source', source_name)
result_DFs.append(file_DFs[sheet_DF])
with pd.ExcelWriter(r'C:\Path\to\resulting\file.xlsx') as writer:
for df_index in range(len(result_DFs)):
# write dataframe to file using simple incremental number as a new sheet name
result_DFs[df_index].to_excel(writer, sheet_name=str(df_index), index=False)
# auto-adjust column width (can be omitted if not needed)
for i, col in enumerate(result_DFs[df_index].columns):
column_len = max(result_DFs[df_index][col].astype(str).str.len().max(), len(col) + 3)
_ = writer.sheets[str(df_index)].set_column(i, i, column_len)

Related

I am trying to iterate a xlsx files in specified path.How to exclude specified file

i have 3 xlsx sheet in particular directory.I am combining into workbook.While combining,i need to exclude specified file.
Path="C:/JackDaniels/100Pipers/"
name="Panic"
writer=ExcelWriter(Path+name+"*.xlsx")#creating a workbook in name "name")
inp=glob.glob(Path+"*.xlsx")
inp=inp.remove(Path+name+"*.xlsx")#to remove ths file to avoid overwrite
# I have a code that will combine sheets
when i tried to run the above code am getting below error
list.remove(x):x not in list
The question is really not clear and you should rephrase it.
If you are trying to combine them in the sense that you want to append all the three sheets to a new empty sheet (for example all of your sheets have same columns) you should make a python file in the same directory as your excel worksheets:
import copy
import os
import pandas as pd
cwd = os.getcwd()
# list to store files
xlsx_files = []
exc_file = 'Exclude.xlsx' # <-- The file name you want to exclude goes here.
out_file = 'Output.xlsx' # <-- The output file name goes here.
# Iterate directory
for file in os.listdir(cwd):
# check only Excel files
if file.endswith('.xlsx'):
xlsx_files.append(file)
print("All xlsx files:", xlsx_files)
df = pd.DataFrame()
aux_files_var = copy.deepcopy(xlsx_files)
for file in aux_files_var:
print(file)
if file == exc_file or file == out_file: continue # <-- here you exclude the file and the output
df = df.append(pd.read_excel(file), ignore_index=True)
xlsx_files.remove(file)
print(f"""As you can see, only exc_file remains in xlsx_files.
Remaining xlsx files:{xlsx_files}""")
print(df)
df.to_excel(out_file)

How to get the string/key from a dictionary of data frames in pandas?

Sure it's simple but can't figure it out.
I've got a code that creates a Dataframe from each tab of a xlsx file, then adds the df to a dictionary of data frames with the sheet name as the key (e.g. df_dict['sheet 1'] etc. The code uses a loop where the string variable sheet_name is assigned to the actual sheet name.
I'm now at a point where I want to tell my code to tidy the df up in specific ways according to which df I'm currently referring to, but I can't find how to refer to the Dataframe key. My code uses a string variable 'sheet_name' in place of the actual key as it loops through sheets.
E.g I want to say 'if df_dict[sheet_name] == df_dict['sheet 1'] then do x'
I think I'm basically getting lost of how to refer to the 'string' or 'key' part of my dataframe in the dictionary in that context.
df_dict ={}
counter = 0
list_to_do = ['sheet1','sheet2']
# Defines directory as a path - not just a string
directory = os.fsencode(file_path)
# For each file in the folder
for file in os.listdir(directory):
# Define the filename
filename = os.fsdecode(file)
# If it's a spreadsheet - work on it
if filename.endswith(".xlsx") or filename.endswith(".xlsx") or filename.endswith(".xlsm"):
print("Working on " + filename)
# Add one to the counter to show how many files are worked through and which one you are on
counter = counter+1
print('Counter = ' + str(counter))
# Load the workbook from the given path and filename
wb = load_workbook(filename = file_path + "/" + filename)
# Unprotect each sheet and save with new name
for sheet in wb:
sheet.protection.disable()
# Get the name of the current worksheet
sheet_name = sheet.title
print('Working on ' + sheet_name)
if sheet_name in list_to_do:
# Loads the excel file from the path, specifying the sheet
df_dict[sheet_name] = pd.read_excel(file_path + "/" + filename,sheet_name)
# Get rid of unwanted rows at the top according to the specific sheet
# for all the keys in the dictionary of dataframes with rows to remove at the top
"""if df_dict[sheet_name] == 'sheet1': Do clean up stuff specific to sheet1"""
Based on my understanding of your question, this is how I would tackle the problem.
Can add error handling etc shall you need it
sheet_1 = pd.DataFrame({"a":[1,2,3],"b":[3,4,5]})
sheet_2 = pd.DataFrame({"c":[1,2,3],"d":[3,4,5]})
dict_sheet = {"s1":sheet_1,"s2":sheet_2}
def callback_1(df):
print(len(df))
def callback_2(df):
print(df.shape)
dict_callback = {"s1":callback_1,"s2":callback_2}
for key,callback in dict_callback.items():
if key in dict_sheet.keys():
df = dict_sheet[key]
output = callback(df)
I've got a code that creates a Dataframe from each tab of a xlsx file, then adds the df to a dictionary of data frames with the sheet name as the key (e.g. df_dict['sheet 1'] etc. The code uses a loop where the string variable sheet_name is assigned to the actual sheet name.
You can use sheet_name=None or sheet_name=list_to_do to load your sheets into a dictionary where keys are the sheet names and values are the dataframes:
dfs = pd.read_excel('myfile.xlsx', sheet_name=list_to_do)
if 'sheet1' in dfs:
# do stuff here
dfs['sheet1'] = cleaned_df

How to import data from .txt file to a specifc excel sheet with Python?

I am trying to automate a process that basically reads in values from text files into certain excel cells. I have a template in excel that will read data from various sheets under certain names. For example, the template will read in data from "Video scores". Video scores is a .txt file that I copy and paste into excel. There are 5 different text files used in each project so it gets tedious after a while and when there are a lot of projects to complete.
How can I import or copy and paste these .txt files into excel to a specified sheet? I have been using openpyxl for the other parts of this project, but I am open to using another library if it can't be done with openpxl.
I've also tried opening and reading a file, but I couldn't figure out how to do what I want with that either. I have found a list of all the files I need, its just a matter of getting them into excel.
Thanks in advance for anyone who helps.
First, import the TXT file into a list in python, i'm asumming the TXT file is like this
1
2
3
4
....
with open(path_txt, "r") as e:
list1 = [i for i in e]
then, we paste the values of the list on the worksheet you need
from openpyxl import load_workbook
wb = load_workbook(path_xlsx)
ws = wb[sheet_name]
ws["A1"] = "values" #just a header
row = 2 #represent the 2 row of the sheet
column = 1 #represent the column "A" of the sheet
for i in list1:
ws.cell(row=row, column=column).value = i #getting the current cell, and writing the value of the list
row += 1 #just setting the current to the next
wb.save(path_xlsx)
Hope this works for you.
Pandas would do the trick!
Approach:
Have a sheet containing path to your files, separator, the corresponding target sheet names
Now read this excel sheet using pandas and iterate over each row for each file details, read the data, write it to new excel sheet of same workbook.
import pandas as pd
file_details_path = r"/Users/path for xl sheet/file details/File2XlDetails.xlsx"
target_sheet_path = r"/Users/path to target xl sheet/File samples/FiletoXl.xlsx"
# create a writer to save the file content in excel
writer = pd.ExcelWriter(target_sheet_path, engine='xlsxwriter')
file_details = pd.read_excel(file_details_path,
dtype = str,
index_col = False
)
def write_to_excel(file, trg_sheet_name):
# writes it to excel
file.to_excel(writer,
sheet_name = trg_sheet_name,
index = False,
)
# loop through each file record
for index, file_dtl in file_details.iterrows():
# you can print and check the row content for reference
print(file_dtl['File_path'])
print(file_dtl['Separator'])
print(file_dtl['Target_sheet_name'])
# reads file
file = pd.read_csv(file_dtl['File_path'],
sep = file_dtl['Separator'],
dtype = str,
index_col = False,
)
write_to_excel(file, file_dtl['Target_sheet_name'])
writer.save()
Hope this helps! Let me know if you run into any issues...

Import using Python - multiple excel files into a dataframe

I want to loop through a directory and find specific xlsx files and then put them each into separate pandas dataframe. The thing here is that I also want all sheets in those excel files to be in the dataframe.
Below is a sample of code that I implemented, I just need to add the logic to pick all sheets:
import pandas as pd
from glob import glob
path = 'path_to_file'
files = glob(path + '/*file*.xlsx')
get_df = lambda f: pd.read_excel(f)
dodf = {f: get_df(f) for f in files}
dodf[files[2]] --- dictionary of dataframes
As described in this answer in Pandas you still have access to the ExcelFile class, which loads the file creating an object.
This object has a .sheet_names property which gives you a list of sheet names in the current file.
xl = pd.ExcelFile('foo.xls')
xl.sheet_names # list of all sheet names
To actually handle the import of the specific sheet, use .parse(sheet_name) on the object of the imported Excel file:
xl.parse(sheet_name) # read a specific sheet to DataFrame
For your code something like:
get_df = lambda f: pd.ExcelFile(f)
dodf = {f: get_df(f) for f in files}
...gives you dodf a dictionary of ExcelFile objects.
filename = 'yourfilehere.xlsx'
a_valid_sheet = dodf[filename].sheet_names[0] # First sheet
df = dodf[filename].parse(sheet_name)

How to concatenate three excels files xlsx using python?

Hello I would like to concatenate three excels files xlsx using python.
I have tried using openpyxl, but I don't know which function could help me to append three worksheet into one.
Do you have any ideas how to do that ?
Thanks a lot
Here's a pandas-based approach. (It's using openpyxl behind the scenes.)
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I'd use xlrd and xlwt. Assuming you literally just need to append these files (rather than doing any real work on them), I'd do something like: Open up a file to write to with xlwt, and then for each of your other three files, loop over the data and add each row to the output file. To get you started:
import xlwt
import xlrd
wkbk = xlwt.Workbook()
outsheet = wkbk.add_sheet('Sheet1')
xlsfiles = [r'C:\foo.xlsx', r'C:\bar.xlsx', r'C:\baz.xlsx']
outrow_idx = 0
for f in xlsfiles:
# This is all untested; essentially just pseudocode for concept!
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(insheet.nrows):
for col_idx in xrange(insheet.ncols):
outsheet.write(outrow_idx, col_idx,
insheet.cell_value(row_idx, col_idx))
outrow_idx += 1
wkbk.save(r'C:\combined.xls')
If your files all have a header line, you probably don't want to repeat that, so you could modify the code above to look more like this:
firstfile = True # Is this the first sheet?
for f in xlsfiles:
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(0 if firstfile else 1, insheet.nrows):
pass # processing; etc
firstfile = False # We're done with the first sheet.
When I combine excel files (mydata1.xlsx, mydata2.xlsx, mydata3.xlsx) for data analysis, here is what I do:
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('myfolder/mydata*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
Then, when I want to save it as one file:
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
Solution with openpyxl only (without a bunch of other dependencies).
This script should take care of merging together an arbitrary number of xlsx documents, whether they have one or multiple sheets. It will preserve the formatting.
There's a function to copy sheets in openpyxl, but it is only from/to the same file. There's also a function insert_rows somewhere, but by itself it won't insert any rows. So I'm afraid we are left to deal (tediously) with one cell at a time.
As much as I dislike using for loops and would rather use something compact and elegant like list comprehension, I don't see how to do that here as this is a side-effect show.
Credit to this answer on copying between workbooks.
#!/usr/bin/env python3
#USAGE
#mergeXLSX.py <a bunch of .xlsx files> ... output.xlsx
#
#where output.xlsx is the unified file
#This works FROM/TO the xlsx format. Libreoffice might help to convert from xls.
#localc --headless --convert-to xlsx somefile.xls
import sys
from copy import copy
from openpyxl import load_workbook,Workbook
def createNewWorkbook(manyWb):
for wb in manyWb:
for sheetName in wb.sheetnames:
o = theOne.create_sheet(sheetName)
safeTitle = o.title
copySheet(wb[sheetName],theOne[safeTitle])
def copySheet(sourceSheet,newSheet):
for row in sourceSheet.rows:
for cell in row:
newCell = newSheet.cell(row=cell.row, column=cell.col_idx,
value= cell.value)
if cell.has_style:
newCell.font = copy(cell.font)
newCell.border = copy(cell.border)
newCell.fill = copy(cell.fill)
newCell.number_format = copy(cell.number_format)
newCell.protection = copy(cell.protection)
newCell.alignment = copy(cell.alignment)
filesInput = sys.argv[1:]
theOneFile = filesInput.pop(-1)
myfriends = [ load_workbook(f) for f in filesInput ]
#try this if you are bored
#myfriends = [ openpyxl.load_workbook(f) for k in range(200) for f in filesInput ]
theOne = Workbook()
del theOne['Sheet'] #We want our new book to be empty. Thanks.
createNewWorkbook(myfriends)
theOne.save(theOneFile)
Tested with openpyxl 2.5.4, python 3.4.
You can simply use pandas and os library to do this.
import pandas as pd
import os
#create an empty dataframe which will have all the combined data
mergedData = pd.DataFrame()
for files in os.listdir():
#make sure you are only reading excel files
if files.endswith('.xlsx'):
data = pd.read_excel(files, index_col=None)
mergedData = mergedData.append(data)
#move the files to other folder so that it does not process multiple times
os.rename(files, 'path to some other folder')
mergedData DF will have all the combined data which you can export in a separate excel or csv file. Same code will work with csv files as well. just replace it in the IF condition
Just to add to p_barill's answer, if you have custom column widths that you need to copy, you can add the following to the bottom of copySheet:
for col in sourceSheet.column_dimensions:
newSheet.column_dimensions[col] = sourceSheet.column_dimensions[col]
I would just post this in a comment on his or her answer but my reputation isn't high enough.

Categories