I am writing a function in pandas that can read excel files from a working directory. Each of the excel files consists of multiple sheets, however the corresponding sheets in each file has the same column names and the number of sheets in each file are the same as well.
I would like to have a function that can merge/append each sheet from the different files such that sheet1 from all the files are merged into a dataframe, sheet2 from all the files are merged as second dataframe and so on. In the end, I would like to know the number of dataframes created.
For this purpose, I wrote the following code:
fpath = "/path to files/"
from os import walk
df = pd.DataFrame()
f = []
xls = []
dff = []
mypath = fpath
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
for i in range(0, len(f)):
f[i] = mypath+"/"+f[i]
xls.append(pd.ExcelFile(f[i]))
cout = 0
for fil in range(0, len(xls)):
for sh in range(0, len(xls)):
if(cout <= len(xls)):
df = df.append(pd.read_excel(xls[sh], fil))
dff.append(df)
cout = cout + 1
I introduced the cout variable to control that after every merging/appending sheet 1 from all the files, the loop should break otherwise all the sheets are merged into a single dataframe.
Problem: The problem is that the function stops after returning only one dataframe in which the first sheets are merged. If I remove the "cout" switch, then all the sheets are merged. Can anyone help me in fixing the function code so that it 1)merges/append the corresponding sheets from each files, 2) make dataframe from (1), and return the dataframes? That way I will have a dataframe for each of the merged/appended sheet.
Can anyone help, Please?
Note: I am doing it in pandas but kindly suggest if you think there are better alternatives in R or any other programming language.
Ok, I looked through your code and I might have an answer for you without looping so much. Maybe it helps, maybe not.
As you point to one folder let us use listdir instead. Use pd.ExcelFile once to get the sheet names and then loop through all the sheet names and pd.concat the different excel-files for each specific sheet_name.
import pandas as pd
import os
# Preparation
p = 'exceltest' #<-- folder name
files = [os.path.join(p,i) for i in os.listdir(p) if i.endswith('.xlsx')]
sheets = pd.ExcelFile(files[0]).sheet_names
# Dictionary holding the sheet_names as keys
dfs = {s: pd.concat(pd.read_excel(f, sheet_name=s) for f in files) for s in sheets}
# Only for demo purpose
print(dfs[sheets[0]])
In my example files (named Workbook1, Workbook2) with sheet_names (Sheet 1, Sheet 2) and (Matrix A,B rowbreak 1,2) this prints:
A B
0 1 2
0 1 2
Related
Most of the articles I'm seeing either:
a) Combine multiple excel single-sheet workbooks into one master workbook with just a single sheet or;
b) Split a multiple-sheet excel workbook into individual workbooks.
However, my goal is to grab all the excel files in a specific folder and save them as individual sheets within one new master excel workbook. I'm trying to rename each sheet name as the name of the original file.
import pandas as pd
import glob
import os
file = "C:\\File\\Path\\"
filename = 'Consolidated Files.xlsx'
pth = os.path.dirname(file)
extension = os.path.splitext(file)[1]
files = glob.glob(os.path.join(pth, '*xlsx'))
w = pd.ExcelWriter(file + filename)
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
df.to_excel(w, sheet_name = f, index = False)
w.save()
How do I adjust the names for each sheet? Also, if you see any opportunities to clean this up please let me know
You cannot rename sheet with special characters because f is full path and file name. You should use only filename to names sheetname, Use os.path.basename to get file name and use split to seperate file name and extension.
for f in files:
print(f)
df = pd.read_excel(f, header = None)
print(df)
# Use basename to get filename with extension
# Use split to seperate filename and extension
new_sheet_name = os.path.basename(f).split('.')[0]
#
df.to_excel(w, sheet_name = new_sheet_name , index = False)
I decided to put my solution here as well, just in case it would be useful to anyone.
Thing is, I wanted to be able to recall where the end sheet came from. However, source workbooks can (and likely will) often have same sheet names like "Sheet 1", so I couldn't just use sheet names from original workbooks. I also could not use source filenames as sheet names since they might be longer than 31 character, which is maximum sheet name length allowed by Excel.
Therefore, I ended up assigning incremental numbers to resulting sheet names, while simultaneously inserting a new column named "source" at the start of each sheet and populating it with file name concatenated with sheet name. Hope it might help someone :)
from glob import glob
import pandas as pd
import os
files_input = glob(r'C:\Path\to\folder\*.xlsx')
result_DFs = []
for xlsx_file in files_input:
file_DFs = pd.read_excel(xlsx_file, sheet_name=None)
# save every sheet from every file as dataframe to an array
for sheet_DF in file_DFs:
source_name = os.path.basename(xlsx_file) + ":" + sheet_DF
file_DFs[sheet_DF].insert(0, 'source', source_name)
result_DFs.append(file_DFs[sheet_DF])
with pd.ExcelWriter(r'C:\Path\to\resulting\file.xlsx') as writer:
for df_index in range(len(result_DFs)):
# write dataframe to file using simple incremental number as a new sheet name
result_DFs[df_index].to_excel(writer, sheet_name=str(df_index), index=False)
# auto-adjust column width (can be omitted if not needed)
for i, col in enumerate(result_DFs[df_index].columns):
column_len = max(result_DFs[df_index][col].astype(str).str.len().max(), len(col) + 3)
_ = writer.sheets[str(df_index)].set_column(i, i, column_len)
I am trying to make a list using pandas before putting all data sets into 2D convolution layers.
And I was able to merge all data in the multiple excel files as a list.
However, the code only reads one chosen sheet name in the multiple excel files.
For example, I have 7 sheets in each excel file; named as 'gpascore1', 'gpascore2', 'gpascore3', 'gpascore4', 'gpascore5', 'gpascore6', 'gpascore7'.
And each sheet has 4 rows and 425 columns like
As shown below, you can see the code.
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'gpascore1') # Read only one chosen sheet available ->
gpascore1 is a sheet name.
df = df.append(data) # But there are 6 more sheets and I would like
to read data from all of the sheets
data_y = df['admit'].values
data_x = []
for i, rows in df.iterrows():
data_x.append([rows['gre'], rows['gpa'], rows['rank']])
df=df.dropna()
df.count()
Then, I got the result as below.
This is because the data from the 'gpascore1' sheet in 3 excel files were merged.
But, I want to read the data of 6 more sheets in the excel files.
Could anyone help me to find out the answer, please?
Thank you
===============<Updated code & errors>==================================
Thank you for the answers and I revised the read_excel() as
data = pd.read_excel(f, 'gpascore1') to
data = pd.read_excel(f, sheet_name=None)
But, I have key errors like below.
Could you give me any suggestions for this issue, please?
Thank you
I actually found this question under the tag of 'tensorflow'. That's hilarious. Ok, so you want to merge all Excel sheets into one dataframe?
import os
import pandas as pd
import glob
glob.glob("C:\\your_path\\*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:\\your_path\\*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
type(all_data)
I do not have a reproducible example for this but asking it based on interest.
With a loop function in R, we are able to obtain all .csv from a directory with the below code:
file.list <- list.files(pattern='*.csv') #obtained name of all the files in directory
df.list <- lapply(file.list, read.csv) #list
Would it be possible for us to loop through a directory with .xlsx files instead with different number of sheets?
For instance: A.xlsx contains 3 sheets, Jan01, Sheet2 and Sheet3; B.xlsx contains 3 sheets, Jan02, Sheet2 and Sheet3 ... and so on. The first sheet name changes.
Is it possible to loop through a directory and just obtain the dataframes for the first sheet in all excel files?
Python or R codes are welcome!
Thank you!
In R
Here is an R solution using the package openxlsx
# get all xlsx files in given directory
filesList <- list.files("d:/Test/", pattern = '.*\\.xlsx', full.names = TRUE)
# pre-allocate list of first sheet names
firstSheetList <- rep(list(NA),length(filesList))
# loop through files and get the data of first sheets
for (k in seq_along(filesList))
firstSheetList[[k]] <- openxlsx::read.xlsx(filesList[k], sheet = 1)
another (fast) R-solution using the readxl-package
l <- lapply( file.list, readxl::read_excel, sheet = 1 )
Sure, its possible using pandas and python.
import pandas as pd
excel_file = pd.ExcelFile('A.xlsx')
dataframes = {sheet: excel_file.parse(sheet) for sheet in excel_file.sheet_names}
dataframes becomes a dictionary, with the keys being the names of the sheets, and the values becoming the dataframe containing the sheet data. You can iterate through them like so:
for k,v in dataframes.items():
print('Sheetname: %s' % k)
print(v.head())
By using Openpyxl
get_sheet_names().
This function returns the names of the sheets in a workbook and you can count the names to tell about total number of sheets in current workbook. The code will be:
>>> wb=openpyxl.load_workbook('testfile.xlsx')
>>> wb.get_sheet_names()
['S1, 'S2', 'S3']
we can access any sheet at one time. Lets suppose we want to access Sheet3. Following code should be written
>>> import openpyxl
>>> wb=openpyxl.load_workbook('testfile.xlsx')
>>> wb.get_sheet_names()
['Sheet1', 'Sheet2', 'Sheet3']
>>> sheet=wb.get_sheet_by_name('Sheet3')
The function get_sheet_by_name('Sheet3') is used to access a particular sheet. This function takes the name of sheet as argument and returns a sheet object. We store that in a variable and can use it like...
>>> sheet
<Worksheet "Sheet3">
>>> type(sheet)
<class 'openpyxl.worksheet.worksheet.Worksheet'>
>>> sheet.title
'Sheet3'
>>>
and eventually:
worksheet = workbook.get_sheet_by_name('Sheet3')
for row_cells in worksheet.iter_rows():
for cell in row_cells:
print('%s: cell.value=%s' % (cell, cell.value) )
For simplicity, lets say we had two workbooks with the first sheet in this format:
You can iterate over each .xlsx file in the directory with glob.glob(), and append the dataframe of the first sheet with pandas.ExcelFile.parse() to a list:
from glob import glob
import pandas as pd
sheets = []
# Go through each xlsx file
for xlsx_file in glob("*.xlsx"):
# Convert sheet to dataframe
xlsx = pd.ExcelFile(xlsx_file)
# Get first sheet and append it
sheet_1 = xlsx.parse(0)
sheets.append(sheet_1)
print(sheets)
Which prints two dataframes contained in a list:
[ x y
0 1 2
1 1 2, x y
0 1 2
1 1 2]
You can also write the above as a list comprehension:
[pd.ExcelFile(xlsx_file).parse(0) for xlsx_file in glob("*.xlsx")]
You could also store the dataframes into a dictionary with filenames as the key:
{xlsx_file: pd.ExcelFile(xlsx_file).parse(0) for xlsx_file in glob("*.xlsx")}
I'm trying to import an Excel document into Python containing data spread out 100 sheets - I also have to repeat this process with dozens of Excel files. My goal is to merge every other sheet into a different dataframe based on the variable date, which is present in each sheet, such as dfx = pd.merge(df1, df3, on="date").
Each sheet, however, has an arbitrary/random name, so I'm trying to rename all the sheets in order 1 - 100, and then merge sheets 1 and 3, 2 and 4, etc.
I'm green at Python loops and programming generally, and am wondering how this can be accomplished? Below is the first attempt that I took at the code, which returned the error TypeError: 'int' object is not iterable. I haven't been able to figure out the merge element of the loop, either.
Any help is much appreciated, thank you!
import pandas as pd
import xlrd
path = 'C:\\Python Structures'
xl = 'sample.xlsx'
xl = os.path.join(path, xl)
dfxl = pd.read_excel(xl)
xl_file = pd.ExcelFile(xl)
num_sheets = len(xl_file.sheet_names)
for i in num_sheets:
df.i = xl_file.parse(i)
Hello I would like to concatenate three excels files xlsx using python.
I have tried using openpyxl, but I don't know which function could help me to append three worksheet into one.
Do you have any ideas how to do that ?
Thanks a lot
Here's a pandas-based approach. (It's using openpyxl behind the scenes.)
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I'd use xlrd and xlwt. Assuming you literally just need to append these files (rather than doing any real work on them), I'd do something like: Open up a file to write to with xlwt, and then for each of your other three files, loop over the data and add each row to the output file. To get you started:
import xlwt
import xlrd
wkbk = xlwt.Workbook()
outsheet = wkbk.add_sheet('Sheet1')
xlsfiles = [r'C:\foo.xlsx', r'C:\bar.xlsx', r'C:\baz.xlsx']
outrow_idx = 0
for f in xlsfiles:
# This is all untested; essentially just pseudocode for concept!
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(insheet.nrows):
for col_idx in xrange(insheet.ncols):
outsheet.write(outrow_idx, col_idx,
insheet.cell_value(row_idx, col_idx))
outrow_idx += 1
wkbk.save(r'C:\combined.xls')
If your files all have a header line, you probably don't want to repeat that, so you could modify the code above to look more like this:
firstfile = True # Is this the first sheet?
for f in xlsfiles:
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(0 if firstfile else 1, insheet.nrows):
pass # processing; etc
firstfile = False # We're done with the first sheet.
When I combine excel files (mydata1.xlsx, mydata2.xlsx, mydata3.xlsx) for data analysis, here is what I do:
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('myfolder/mydata*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
Then, when I want to save it as one file:
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
Solution with openpyxl only (without a bunch of other dependencies).
This script should take care of merging together an arbitrary number of xlsx documents, whether they have one or multiple sheets. It will preserve the formatting.
There's a function to copy sheets in openpyxl, but it is only from/to the same file. There's also a function insert_rows somewhere, but by itself it won't insert any rows. So I'm afraid we are left to deal (tediously) with one cell at a time.
As much as I dislike using for loops and would rather use something compact and elegant like list comprehension, I don't see how to do that here as this is a side-effect show.
Credit to this answer on copying between workbooks.
#!/usr/bin/env python3
#USAGE
#mergeXLSX.py <a bunch of .xlsx files> ... output.xlsx
#
#where output.xlsx is the unified file
#This works FROM/TO the xlsx format. Libreoffice might help to convert from xls.
#localc --headless --convert-to xlsx somefile.xls
import sys
from copy import copy
from openpyxl import load_workbook,Workbook
def createNewWorkbook(manyWb):
for wb in manyWb:
for sheetName in wb.sheetnames:
o = theOne.create_sheet(sheetName)
safeTitle = o.title
copySheet(wb[sheetName],theOne[safeTitle])
def copySheet(sourceSheet,newSheet):
for row in sourceSheet.rows:
for cell in row:
newCell = newSheet.cell(row=cell.row, column=cell.col_idx,
value= cell.value)
if cell.has_style:
newCell.font = copy(cell.font)
newCell.border = copy(cell.border)
newCell.fill = copy(cell.fill)
newCell.number_format = copy(cell.number_format)
newCell.protection = copy(cell.protection)
newCell.alignment = copy(cell.alignment)
filesInput = sys.argv[1:]
theOneFile = filesInput.pop(-1)
myfriends = [ load_workbook(f) for f in filesInput ]
#try this if you are bored
#myfriends = [ openpyxl.load_workbook(f) for k in range(200) for f in filesInput ]
theOne = Workbook()
del theOne['Sheet'] #We want our new book to be empty. Thanks.
createNewWorkbook(myfriends)
theOne.save(theOneFile)
Tested with openpyxl 2.5.4, python 3.4.
You can simply use pandas and os library to do this.
import pandas as pd
import os
#create an empty dataframe which will have all the combined data
mergedData = pd.DataFrame()
for files in os.listdir():
#make sure you are only reading excel files
if files.endswith('.xlsx'):
data = pd.read_excel(files, index_col=None)
mergedData = mergedData.append(data)
#move the files to other folder so that it does not process multiple times
os.rename(files, 'path to some other folder')
mergedData DF will have all the combined data which you can export in a separate excel or csv file. Same code will work with csv files as well. just replace it in the IF condition
Just to add to p_barill's answer, if you have custom column widths that you need to copy, you can add the following to the bottom of copySheet:
for col in sourceSheet.column_dimensions:
newSheet.column_dimensions[col] = sourceSheet.column_dimensions[col]
I would just post this in a comment on his or her answer but my reputation isn't high enough.