Pandas: export all csv files from multiple xlsx files - python

I have many xlsx files in directory with multiple sheets inside. Without open file I don't know how sheets are called.
I want to export all sheets to csv files, one sheet = one csv file.
I know that my code is very ugly and it is not optimized well.
The header.txt file will help me in next step to create a dict to rename columns according to my pattern, so please ignore this part :)
import glob
xls_files = glob.glob('**/*.xlsx')
header = []
for xls_file in xls_files:
print(f'{xls_file =}')
file_name = xls_file.replace("xlsx\\","").replace(' ',"_").split('.')[0]
print(f'{file_name =}')
df = pd.read_excel(xls_file, sheet_name=None)
# for sheet in df.keys():
# print(f'{sheet =}')
# sheet_name = sheet.replace(".","_").replace(' ',"_")
# csv_file_name = (f'{file_name}_{sheet_name}.csv')
# sheet.to_csv(csv_file_name ,index=False)
# print(sheet.head())
for sheet in df.keys():
print(f'{sheet =}')
df_temp = pd.read_excel(xls_file, sheet_name=sheet)
sheet_name = sheet.replace(".","_").replace(' ',"_")
csv_file_name = (f'{file_name}_{sheet_name}.csv')
print(f'{csv_file_name =}')
for column in df_temp.columns:
if column not in header:
header.append(column)
print(column)
df_temp.to_csv(csv_file_name, index=False)
with open('header.txt' 'a+') as f:
for ele in header:
f.write(ele)
`
How can I improve my code to performance better and to not reread the same excel file to get into one sheet?
This works for me but it is very slow.

Related

Multiple Excel Files as Separate Sheets Using Python

Most of the articles I'm seeing either:
a) Combine multiple excel single-sheet workbooks into one master workbook with just a single sheet or;
b) Split a multiple-sheet excel workbook into individual workbooks.
However, my goal is to grab all the excel files in a specific folder and save them as individual sheets within one new master excel workbook. I'm trying to rename each sheet name as the name of the original file.
import pandas as pd
import glob
import os
file = "C:\\File\\Path\\"
filename = 'Consolidated Files.xlsx'
pth = os.path.dirname(file)
extension = os.path.splitext(file)[1]
files = glob.glob(os.path.join(pth, '*xlsx'))
w = pd.ExcelWriter(file + filename)
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
df.to_excel(w, sheet_name = f, index = False)
w.save()
How do I adjust the names for each sheet? Also, if you see any opportunities to clean this up please let me know
You cannot rename sheet with special characters because f is full path and file name. You should use only filename to names sheetname, Use os.path.basename to get file name and use split to seperate file name and extension.
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
# Use basename to get filename with extension
# Use split to seperate filename and extension
new_sheet_name = os.path.basename(f).split('.')[0]
#
df.to_excel(w, sheet_name = new_sheet_name , index = False)
I decided to put my solution here as well, just in case it would be useful to anyone.
Thing is, I wanted to be able to recall where the end sheet came from. However, source workbooks can (and likely will) often have same sheet names like "Sheet 1", so I couldn't just use sheet names from original workbooks. I also could not use source filenames as sheet names since they might be longer than 31 character, which is maximum sheet name length allowed by Excel.
Therefore, I ended up assigning incremental numbers to resulting sheet names, while simultaneously inserting a new column named "source" at the start of each sheet and populating it with file name concatenated with sheet name. Hope it might help someone :)
from glob import glob
import pandas as pd
import os
files_input = glob(r'C:\Path\to\folder\*.xlsx')
result_DFs = []
for xlsx_file in files_input:
file_DFs = pd.read_excel(xlsx_file, sheet_name=None)
# save every sheet from every file as dataframe to an array
for sheet_DF in file_DFs:
source_name = os.path.basename(xlsx_file) + ":" + sheet_DF
file_DFs[sheet_DF].insert(0, 'source', source_name)
result_DFs.append(file_DFs[sheet_DF])
with pd.ExcelWriter(r'C:\Path\to\resulting\file.xlsx') as writer:
for df_index in range(len(result_DFs)):
# write dataframe to file using simple incremental number as a new sheet name
result_DFs[df_index].to_excel(writer, sheet_name=str(df_index), index=False)
# auto-adjust column width (can be omitted if not needed)
for i, col in enumerate(result_DFs[df_index].columns):
column_len = max(result_DFs[df_index][col].astype(str).str.len().max(), len(col) + 3)
_ = writer.sheets[str(df_index)].set_column(i, i, column_len)

Python: How to copy Excel worksheet from multiple Excel files to one Excel file that contains all the worksheets from other Excel files

It's my first time to use pandas, I have multiple excel files, that i want to combine all into one Excel file using python pandas.
I managed to merge the content of the first sheets in each excel file into one sheet in a new excel file like this shown in the figure below:
combined sheets in one sheet
I wrote this code to implement this:
import glob
import pandas as pd
path = "C:/folder"
file_identifier = "*.xls"
all_data = pd.DataFrame()
for f in glob.glob(path + "/*" + file_identifier):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
writer = pd.ExcelWriter('combined.xls', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Summary Sheet')
writer.save()
file_df = pd.read_excel("C:/folder/combined.xls")
# Keep only FIRST record from set of duplicates
file_df_first_record = file_df.drop_duplicates(subset=["Test summary", "Unnamed: 1", "Unnamed: 2",
"Unnamed: 3"], keep="first")
file_df_first_record.to_excel("filtered.xls", index=False, sheet_name='Summary Sheet')
But I have two issues:
How to remove cells that has "Unnamed" as shown in the previous figure
How to copy other worksheets (the second worksheet in each Excel file, not the first worksheet) from all other Excel files and put it in one Excel file with multiple worksheets and with different students names like shown in the picture.
all worksheets in one excel file
So i managed to combine worksheet1 from all Excel files in one sheet, but now I want to copy A, B, C, D, E worksheets into one Excel file that has all other remaining worksheets in other Excel files.
Each Excel file of the ones I have looks like this
single excel file
If you want to have all data gathered together in one worksheet you can use the following script:
Put all excel workbooks (i.e. excel files) to be processed into a
folder (see variable paths).
Get the paths of all workbooks in that folder using
glob.glob.
Return all worksheets of each workbook with read_excel(path, sheet_name=None) and prepare them for merging.
Merge all worksheets with pd.concat.
Export the final output to_excel.
import pandas as pd
import glob
paths = glob.glob(r"C:\excelfiles\*.xlsx")
path_save = r"finished.xlsx"
df_lst = [pd.read_excel(path, sheet_name=None).values() for path in paths]
df_lst = [y.transpose().reset_index().transpose() for x in df_lst for y in x]
df_result = pd.concat(df_lst, ignore_index=True)
df_result.to_excel(path_save, index=False, header=False)

How to import data from .txt file to a specifc excel sheet with Python?

I am trying to automate a process that basically reads in values from text files into certain excel cells. I have a template in excel that will read data from various sheets under certain names. For example, the template will read in data from "Video scores". Video scores is a .txt file that I copy and paste into excel. There are 5 different text files used in each project so it gets tedious after a while and when there are a lot of projects to complete.
How can I import or copy and paste these .txt files into excel to a specified sheet? I have been using openpyxl for the other parts of this project, but I am open to using another library if it can't be done with openpxl.
I've also tried opening and reading a file, but I couldn't figure out how to do what I want with that either. I have found a list of all the files I need, its just a matter of getting them into excel.
Thanks in advance for anyone who helps.
First, import the TXT file into a list in python, i'm asumming the TXT file is like this
1
2
3
4
....
with open(path_txt, "r") as e:
list1 = [i for i in e]
then, we paste the values of the list on the worksheet you need
from openpyxl import load_workbook
wb = load_workbook(path_xlsx)
ws = wb[sheet_name]
ws["A1"] = "values" #just a header
row = 2 #represent the 2 row of the sheet
column = 1 #represent the column "A" of the sheet
for i in list1:
ws.cell(row=row, column=column).value = i #getting the current cell, and writing the value of the list
row += 1 #just setting the current to the next
wb.save(path_xlsx)
Hope this works for you.
Pandas would do the trick!
Approach:
Have a sheet containing path to your files, separator, the corresponding target sheet names
Now read this excel sheet using pandas and iterate over each row for each file details, read the data, write it to new excel sheet of same workbook.
import pandas as pd
file_details_path = r"/Users/path for xl sheet/file details/File2XlDetails.xlsx"
target_sheet_path = r"/Users/path to target xl sheet/File samples/FiletoXl.xlsx"
# create a writer to save the file content in excel
writer = pd.ExcelWriter(target_sheet_path, engine='xlsxwriter')
file_details = pd.read_excel(file_details_path,
dtype = str,
index_col = False
)
def write_to_excel(file, trg_sheet_name):
# writes it to excel
file.to_excel(writer,
sheet_name = trg_sheet_name,
index = False,
)
# loop through each file record
for index, file_dtl in file_details.iterrows():
# you can print and check the row content for reference
print(file_dtl['File_path'])
print(file_dtl['Separator'])
print(file_dtl['Target_sheet_name'])
# reads file
file = pd.read_csv(file_dtl['File_path'],
sep = file_dtl['Separator'],
dtype = str,
index_col = False,
)
write_to_excel(file, file_dtl['Target_sheet_name'])
writer.save()
Hope this helps! Let me know if you run into any issues...

Python pandas merge and save with existed sheets

i want merge multi excel file(1.xlsm, 2.xlsm....) to [A.xlsm] file with macro, 3sheets
so i try to merge
# input_file = (./*.xlsx)
all_data = pd.DataFrame()
for f in (input_file):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True, sort=False)
writer = pd.ExcelWriter(A.xlsm, engine='openpyxl')
all_data.to_excel(writer,'Sheet1')
writer.save()
the code dose not error,
but result file[A.xlsm] is error to open,
so i change extension to A.xlsx and open.
it opening is OK but disappear all Sheets and macro.
how can i merge multi xlsx file to xlsm file with macro?
I believe that if you want to use macro-enabled workbooks you need to load them with keep_vba=True:
from openpyxl import load_workbook
XlMacroFile = load_workbook('A.xlsm',keep_vba=True)
To preserve separate sheets, you can do something like
df_list = #list of your dataframes
filename = #name of your output file
with pd.ExcelWriter(filename) as writer:
for df in df_list:
df.to_excel(writer, sheet_name='sheet_name_goes_here')
This will write each dataframe in a separate sheet in your output excel file.

How to combine multiple excel files having multiple equal number of sheets in each excel files

I am able to combine multiple excel files having one sheet currently.
I want to combine multiple sheets having two different sheets in each excel file with giving name to each sheets How can I achieve this?
Here below is my current code for combining single sheet in multiple excel files without giving sheet name to Combined excel file
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
First combine the first and the second sheet separately
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
def combine_excel_to_dfs(excel_names, sheet_name):
sheet_frames = [pd.read_excel(x, sheet_name=sheet_name) for x in excel_names]
combined_df = pd.concat(sheet_frames).reset_index(drop=True)
return combined_df
df_first = combine_excel_to_dfs(excel_names, 0)
df_second = combine_excel_to_dfs(excel_names, 1)
Use pd.ExcelWriter
And write these sheets to the same excel file:
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('two_sheets_combined.xlsx', engine='xlsxwriter')
# Write each dataframe to a different worksheet.
df_first.to_excel(writer, sheet_name='Sheet1')
df_second.to_excel(writer, sheet_name='Sheet2')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
You can use:
#number of sheets
N = 2
#get all sheets to nested lists
frames = [[x.parse(y, index_col=None) for y in x.sheet_names] for x in excels]
#print (frames)
#combine firt dataframe from first list with first df with second list...
combined = [pd.concat([x[i] for x in frames], ignore_index=True) for i in range(N)]
#print (combined)
#write to file
writer = pd.ExcelWriter('c.xlsx', engine='xlsxwriter')
for i, x in enumerate(combined):
x.to_excel(writer, sheet_name='Sheet{}'.format(i + 1))
writer.save()

Categories