Looping through a folder to merge several excel sheets into one column - python

I have several workbooks, each with three sheets. I want to loop through each workbook and merge all the data from sheet_1 into a new workbook_1 file, sheet_2 into workbook_2 file & sheet_3 into workbook_3.
As far as I can tell the script below does everything I need, except rather than appending the data, it overwrites the data from the previous iteration.
For the sake of parsimony I've shortened, cleaned & simplified my script, but I'm happy to share the full script if needed.
import pandas as pd
import glob
search_dir= ('/Users/PATH/*.xlsx')
sheet_names = ['sheet_1','sheet_2','sheet_2']
def a_joiner(sheet):
for loop_x in glob.glob(search_dir):
try:
if sheet == 'sheet_1':
id_file= pd.ExcelFile(loop_x)
df_1 = id_file.parse(sheet, header= None)
writer= pd.ExcelWriter('/Users/PATH/%s.xlsx' %(sheet), engine= 'xlsxwriter')
df_1.to_excel(writer)
writer.save()
elif sheet == 'sheet_2':
#do same as above
else:
#and do same as above again
except Exception as e:
print('Error:',e)
for sheet in sheet_names:
a_joiner(sheet)

You can also easilly append data like:
df = []
for f in ['c:\\file1.xls', 'c:\\ file2.xls']:
data = pd.read_excel(f, 'Sheet1').iloc[:-2]
data.index = [os.path.basename(f)] * len(data)
df.append(data)
df = pd.concat(df)
From:
Using pandas Combining/merging 2 different Excel files/sheets

Related

Read CSV sheet data and created new one

I have a CSV file which have multiple sheets in it. Want to read it sheet by sheet and filter some data and want to create csv file in same format. how can I do that. Please suggest. I was trying it though pandas.ExcelReader but its not working for CSV file.
you can use the following code for this may help!
import pandas as pd
def read_excel_sheets(xls_path):
"""Read all sheets of an Excel workbook and return a single DataFrame"""
print(f'Loading {xls_path} into pandas')
xl = pd.ExcelFile(xls_path)
df = pd.DataFrame()
columns = None
for idx, name in enumerate(xl.sheet_names):
print(f'Reading sheet #{idx}: {name}')
sheet = xl.parse(name)
if idx == 0:
# Save column names from the first sheet to match for append
columns = sheet.columns
sheet.columns = columns
# Assume index of existing data frame when appended
df = df.append(sheet, ignore_index=True)
return df
the resource for this code is the link below:
click here
and for converting it back to csv you can follow the post which link is
attached here

Can I export a dataframe to excel as the very first sheet?

Running dataframe.to_excel() automatically saves the dataframe as the last sheet in the Excel file.
Is there a way to save a dataframe as the very first sheet, so that, when you open the spreadsheet, Excel shows it as the first on the left?
The only workaround I have found is to first export an empty dataframe to the tab with the name I want as first, then export the others, then export the real dataframe I want to the tab with the name I want. Example in the code below. Is there a more elegant way? More generically, is there a way to specifically choose the position of the sheet you are exporting to (first, third, etc)?
Of course this arises because the dataframe I want as first is the result of some calculations based on all the others, so I cannot export it.
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
pd.DataFrame().to_excel(writer,'this should be the 1st')
other_df.to_excel(writer,'other df')
first_df.to_excel(writer,'this should be the 1st')
writer.save()
writer.close()
It is possible to re-arrange the sheets after they have been created:
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
other_df.to_excel(writer,'Sheet2')
first_df.to_excel(writer,'Sheet1')
writer.save()
This will give you this output:
Add this before you save the workbook:
workbook = writer.book
workbook.worksheets_objs.sort(key=lambda x: x.name)

Using Pandas to Convert from Excel to CSV and I Have Multiple Possible Excel Sheet Names

I am trying to convert a large number of Excel documents to CSV using Python, and the sheet I am converting from each document can either be called "Pivot", "PVT", "pivot", or "pvt". The way I am doing some right now seems to be working, but I was wondering if there was any quicker way as this takes a long time to go through my Excel files. Is there a way I can accomplish the same thing all in one pd.read_excel line using an OR operator to specify multiple variations of the sheet name?
for f in glob.glob("../Test/Drawsheet*.xlsx"):
try:
data_xlsx = pd.read_excel(f, 'PVT', index_col=None)
except:
try:
data_xlsx = pd.read_excel(f, 'pvt', index_col=None)
except:
try:
data_xlsx = pd.read_excel(f, 'pivot', index_col=None)
except:
try:
data_xlsx = pd.read_excel(f, 'Pivot', index_col=None)
except:
continue
data_xlsx.to_csv('csvfile' + str(counter) + '.csv', encoding='utf-8')
counter += 1
Your problem isn't so much about find the correct special syntax for pd.read_excel but rather knowing which sheet to read from. Pandas has an ExcelFile that encapsulates and some basic info about an Excel file. The class has a sheet_names property that tell you what sheets are in the file. (Unfortunately documnetation on this class is a bit hard to find so I can't give you a link)
valid_sheet_names = ['PVT', 'pvt', 'pivot', 'Pivot']
for f in glob.iglob('../Test/Drawsheet*.xlsx'):
file = pd.ExcelFile(f)
sheet_name = None
for name in file.sheet_names:
if name in valid_sheet_names:
sheet_name = name
break
if sheet_name is None:
continue
data_xlsx = pd.read_excel(f, sheet_name, index_col=None)
...
However, this is not strictly equivalent to your code as it does not do 2 things:
Cascade read_excel if the chosen sheet fails to be loaded into a data frame
Have a priority ranking for the sheet names (like PVT first, then pvt, then pivot, etc.)
I'll leave you on how to handle these two problems as your program requires.

How to combine multiple excel files having multiple equal number of sheets in each excel files

I am able to combine multiple excel files having one sheet currently.
I want to combine multiple sheets having two different sheets in each excel file with giving name to each sheets How can I achieve this?
Here below is my current code for combining single sheet in multiple excel files without giving sheet name to Combined excel file
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
First combine the first and the second sheet separately
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
def combine_excel_to_dfs(excel_names, sheet_name):
sheet_frames = [pd.read_excel(x, sheet_name=sheet_name) for x in excel_names]
combined_df = pd.concat(sheet_frames).reset_index(drop=True)
return combined_df
df_first = combine_excel_to_dfs(excel_names, 0)
df_second = combine_excel_to_dfs(excel_names, 1)
Use pd.ExcelWriter
And write these sheets to the same excel file:
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('two_sheets_combined.xlsx', engine='xlsxwriter')
# Write each dataframe to a different worksheet.
df_first.to_excel(writer, sheet_name='Sheet1')
df_second.to_excel(writer, sheet_name='Sheet2')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
You can use:
#number of sheets
N = 2
#get all sheets to nested lists
frames = [[x.parse(y, index_col=None) for y in x.sheet_names] for x in excels]
#print (frames)
#combine firt dataframe from first list with first df with second list...
combined = [pd.concat([x[i] for x in frames], ignore_index=True) for i in range(N)]
#print (combined)
#write to file
writer = pd.ExcelWriter('c.xlsx', engine='xlsxwriter')
for i, x in enumerate(combined):
x.to_excel(writer, sheet_name='Sheet{}'.format(i + 1))
writer.save()

How to concatenate three excels files xlsx using python?

Hello I would like to concatenate three excels files xlsx using python.
I have tried using openpyxl, but I don't know which function could help me to append three worksheet into one.
Do you have any ideas how to do that ?
Thanks a lot
Here's a pandas-based approach. (It's using openpyxl behind the scenes.)
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I'd use xlrd and xlwt. Assuming you literally just need to append these files (rather than doing any real work on them), I'd do something like: Open up a file to write to with xlwt, and then for each of your other three files, loop over the data and add each row to the output file. To get you started:
import xlwt
import xlrd
wkbk = xlwt.Workbook()
outsheet = wkbk.add_sheet('Sheet1')
xlsfiles = [r'C:\foo.xlsx', r'C:\bar.xlsx', r'C:\baz.xlsx']
outrow_idx = 0
for f in xlsfiles:
# This is all untested; essentially just pseudocode for concept!
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(insheet.nrows):
for col_idx in xrange(insheet.ncols):
outsheet.write(outrow_idx, col_idx,
insheet.cell_value(row_idx, col_idx))
outrow_idx += 1
wkbk.save(r'C:\combined.xls')
If your files all have a header line, you probably don't want to repeat that, so you could modify the code above to look more like this:
firstfile = True # Is this the first sheet?
for f in xlsfiles:
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(0 if firstfile else 1, insheet.nrows):
pass # processing; etc
firstfile = False # We're done with the first sheet.
When I combine excel files (mydata1.xlsx, mydata2.xlsx, mydata3.xlsx) for data analysis, here is what I do:
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('myfolder/mydata*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
Then, when I want to save it as one file:
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
Solution with openpyxl only (without a bunch of other dependencies).
This script should take care of merging together an arbitrary number of xlsx documents, whether they have one or multiple sheets. It will preserve the formatting.
There's a function to copy sheets in openpyxl, but it is only from/to the same file. There's also a function insert_rows somewhere, but by itself it won't insert any rows. So I'm afraid we are left to deal (tediously) with one cell at a time.
As much as I dislike using for loops and would rather use something compact and elegant like list comprehension, I don't see how to do that here as this is a side-effect show.
Credit to this answer on copying between workbooks.
#!/usr/bin/env python3
#USAGE
#mergeXLSX.py <a bunch of .xlsx files> ... output.xlsx
#
#where output.xlsx is the unified file
#This works FROM/TO the xlsx format. Libreoffice might help to convert from xls.
#localc --headless --convert-to xlsx somefile.xls
import sys
from copy import copy
from openpyxl import load_workbook,Workbook
def createNewWorkbook(manyWb):
for wb in manyWb:
for sheetName in wb.sheetnames:
o = theOne.create_sheet(sheetName)
safeTitle = o.title
copySheet(wb[sheetName],theOne[safeTitle])
def copySheet(sourceSheet,newSheet):
for row in sourceSheet.rows:
for cell in row:
newCell = newSheet.cell(row=cell.row, column=cell.col_idx,
value= cell.value)
if cell.has_style:
newCell.font = copy(cell.font)
newCell.border = copy(cell.border)
newCell.fill = copy(cell.fill)
newCell.number_format = copy(cell.number_format)
newCell.protection = copy(cell.protection)
newCell.alignment = copy(cell.alignment)
filesInput = sys.argv[1:]
theOneFile = filesInput.pop(-1)
myfriends = [ load_workbook(f) for f in filesInput ]
#try this if you are bored
#myfriends = [ openpyxl.load_workbook(f) for k in range(200) for f in filesInput ]
theOne = Workbook()
del theOne['Sheet'] #We want our new book to be empty. Thanks.
createNewWorkbook(myfriends)
theOne.save(theOneFile)
Tested with openpyxl 2.5.4, python 3.4.
You can simply use pandas and os library to do this.
import pandas as pd
import os
#create an empty dataframe which will have all the combined data
mergedData = pd.DataFrame()
for files in os.listdir():
#make sure you are only reading excel files
if files.endswith('.xlsx'):
data = pd.read_excel(files, index_col=None)
mergedData = mergedData.append(data)
#move the files to other folder so that it does not process multiple times
os.rename(files, 'path to some other folder')
mergedData DF will have all the combined data which you can export in a separate excel or csv file. Same code will work with csv files as well. just replace it in the IF condition
Just to add to p_barill's answer, if you have custom column widths that you need to copy, you can add the following to the bottom of copySheet:
for col in sourceSheet.column_dimensions:
newSheet.column_dimensions[col] = sourceSheet.column_dimensions[col]
I would just post this in a comment on his or her answer but my reputation isn't high enough.

Categories