Import, rename, and merge Excel sheet in Python using loop - python

I'm trying to import an Excel document into Python containing data spread out 100 sheets - I also have to repeat this process with dozens of Excel files. My goal is to merge every other sheet into a different dataframe based on the variable date, which is present in each sheet, such as dfx = pd.merge(df1, df3, on="date").
Each sheet, however, has an arbitrary/random name, so I'm trying to rename all the sheets in order 1 - 100, and then merge sheets 1 and 3, 2 and 4, etc.
I'm green at Python loops and programming generally, and am wondering how this can be accomplished? Below is the first attempt that I took at the code, which returned the error TypeError: 'int' object is not iterable. I haven't been able to figure out the merge element of the loop, either.
Any help is much appreciated, thank you!
import pandas as pd
import xlrd
path = 'C:\\Python Structures'
xl = 'sample.xlsx'
xl = os.path.join(path, xl)
dfxl = pd.read_excel(xl)
xl_file = pd.ExcelFile(xl)
num_sheets = len(xl_file.sheet_names)
for i in num_sheets:
df.i = xl_file.parse(i)

Related

Format and manipulate data across multiple Excel sheets in Python using openpyxl before converting to Dataframe

I need some help with editing the sheets within my Excel workbook in python, before I stack the data using pd.concat(). Each sheet (~100) within my Excel workbook is structured identically, with the unique identifier for each sheet being a 6-digit code that is found in line 1 of the worksheet.
I've already done the following steps to import the file, unmerge rows 1-4, and insert a new column 'C':
import openpyxl
import pandas as pd
wb = openpyxl.load_workbook('data_sheets.xlsx')
for sheet in wb.worksheets:
sheet.merged_cells
for merge in list(sheet.merged_cells):
sheet.unmerge_cells(range_string=str(merge))
sheet.insert_cols(3, 1)
print(sheet)
wb.save('workbook_test.xlsx')
#concat once worksheets have been edited
df= pd.concat(pd.read_excel('workbook_test.xlsx, sheet_name= None), ignore_index= True)
Before stacking the data however, I would like to make the following additonal (sequential) changes to every sheet:
Extract from row 1 the right 8 characters (in excel the equivalent of this would be =RIGHT(A1, 8) - this is to pull the unique code off of each sheet, which will look like '(000000)'.
Populate column C from rows 6-282 with the unique code.
Delete rows 1-5
The end result would make each sheet within the workbook look like this:
Is this possible to do with openpyxl, and if so, how? Any direction or assistance with this would be much appreciated - thank you!
Here is a 100% openpyxl approach to achieve what you're looking for :
from openpyxl import load_workbook
wb = load_workbook("workbook_test.xlsx")
for ws in wb:
ws.unmerge_cells("A1:O1") #unmerge first row till O
ws_uid = ws.cell(row=1, column=1).value[-8:] #get the sheet's UID
for num_row in range(6, 282):
ws.cell(row=num_row, column=3).value = '="{}"'.format(ws_uid) #write UID in Column C
ws.delete_rows(1, 5) #delete first 5 rows
wb.save("workbook_test.xlsx")
NB : This assume there is already an empty column (C).

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())
Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)
You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)
It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!

Possible to loop through excel files with differently named sheets, and import into a list?

I do not have a reproducible example for this but asking it based on interest.
With a loop function in R, we are able to obtain all .csv from a directory with the below code:
file.list <- list.files(pattern='*.csv') #obtained name of all the files in directory
df.list <- lapply(file.list, read.csv) #list
Would it be possible for us to loop through a directory with .xlsx files instead with different number of sheets?
For instance: A.xlsx contains 3 sheets, Jan01, Sheet2 and Sheet3; B.xlsx contains 3 sheets, Jan02, Sheet2 and Sheet3 ... and so on. The first sheet name changes.
Is it possible to loop through a directory and just obtain the dataframes for the first sheet in all excel files?
Python or R codes are welcome!
Thank you!
In R
Here is an R solution using the package openxlsx
# get all xlsx files in given directory
filesList <- list.files("d:/Test/", pattern = '.*\\.xlsx', full.names = TRUE)
# pre-allocate list of first sheet names
firstSheetList <- rep(list(NA),length(filesList))
# loop through files and get the data of first sheets
for (k in seq_along(filesList))
firstSheetList[[k]] <- openxlsx::read.xlsx(filesList[k], sheet = 1)
another (fast) R-solution using the readxl-package
l <- lapply( file.list, readxl::read_excel, sheet = 1 )
Sure, its possible using pandas and python.
import pandas as pd
excel_file = pd.ExcelFile('A.xlsx')
dataframes = {sheet: excel_file.parse(sheet) for sheet in excel_file.sheet_names}
dataframes becomes a dictionary, with the keys being the names of the sheets, and the values becoming the dataframe containing the sheet data. You can iterate through them like so:
for k,v in dataframes.items():
print('Sheetname: %s' % k)
print(v.head())
By using Openpyxl
get_sheet_names().
This function returns the names of the sheets in a workbook and you can count the names to tell about total number of sheets in current workbook. The code will be:
>>> wb=openpyxl.load_workbook('testfile.xlsx')
>>> wb.get_sheet_names()
['S1, 'S2', 'S3']
we can access any sheet at one time. Lets suppose we want to access Sheet3. Following code should be written
>>> import openpyxl
>>> wb=openpyxl.load_workbook('testfile.xlsx')
>>> wb.get_sheet_names()
['Sheet1', 'Sheet2', 'Sheet3']
>>> sheet=wb.get_sheet_by_name('Sheet3')
The function get_sheet_by_name('Sheet3') is used to access a particular sheet. This function takes the name of sheet as argument and returns a sheet object. We store that in a variable and can use it like...
>>> sheet
<Worksheet "Sheet3">
>>> type(sheet)
<class 'openpyxl.worksheet.worksheet.Worksheet'>
>>> sheet.title
'Sheet3'
>>>
and eventually:
worksheet = workbook.get_sheet_by_name('Sheet3')
for row_cells in worksheet.iter_rows():
for cell in row_cells:
print('%s: cell.value=%s' % (cell, cell.value) )
For simplicity, lets say we had two workbooks with the first sheet in this format:
You can iterate over each .xlsx file in the directory with glob.glob(), and append the dataframe of the first sheet with pandas.ExcelFile.parse() to a list:
from glob import glob
import pandas as pd
sheets = []
# Go through each xlsx file
for xlsx_file in glob("*.xlsx"):
# Convert sheet to dataframe
xlsx = pd.ExcelFile(xlsx_file)
# Get first sheet and append it
sheet_1 = xlsx.parse(0)
sheets.append(sheet_1)
print(sheets)
Which prints two dataframes contained in a list:
[ x y
0 1 2
1 1 2, x y
0 1 2
1 1 2]
You can also write the above as a list comprehension:
[pd.ExcelFile(xlsx_file).parse(0) for xlsx_file in glob("*.xlsx")]
You could also store the dataframes into a dictionary with filenames as the key:
{xlsx_file: pd.ExcelFile(xlsx_file).parse(0) for xlsx_file in glob("*.xlsx")}

combining excel sheets individually using pandas

I am writing a function in pandas that can read excel files from a working directory. Each of the excel files consists of multiple sheets, however the corresponding sheets in each file has the same column names and the number of sheets in each file are the same as well.
I would like to have a function that can merge/append each sheet from the different files such that sheet1 from all the files are merged into a dataframe, sheet2 from all the files are merged as second dataframe and so on. In the end, I would like to know the number of dataframes created.
For this purpose, I wrote the following code:
fpath = "/path to files/"
from os import walk
df = pd.DataFrame()
f = []
xls = []
dff = []
mypath = fpath
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
for i in range(0, len(f)):
f[i] = mypath+"/"+f[i]
xls.append(pd.ExcelFile(f[i]))
cout = 0
for fil in range(0, len(xls)):
for sh in range(0, len(xls)):
if(cout <= len(xls)):
df = df.append(pd.read_excel(xls[sh], fil))
dff.append(df)
cout = cout + 1
I introduced the cout variable to control that after every merging/appending sheet 1 from all the files, the loop should break otherwise all the sheets are merged into a single dataframe.
Problem: The problem is that the function stops after returning only one dataframe in which the first sheets are merged. If I remove the "cout" switch, then all the sheets are merged. Can anyone help me in fixing the function code so that it 1)merges/append the corresponding sheets from each files, 2) make dataframe from (1), and return the dataframes? That way I will have a dataframe for each of the merged/appended sheet.
Can anyone help, Please?
Note: I am doing it in pandas but kindly suggest if you think there are better alternatives in R or any other programming language.
Ok, I looked through your code and I might have an answer for you without looping so much. Maybe it helps, maybe not.
As you point to one folder let us use listdir instead. Use pd.ExcelFile once to get the sheet names and then loop through all the sheet names and pd.concat the different excel-files for each specific sheet_name.
import pandas as pd
import os
# Preparation
p = 'exceltest' #<-- folder name
files = [os.path.join(p,i) for i in os.listdir(p) if i.endswith('.xlsx')]
sheets = pd.ExcelFile(files[0]).sheet_names
# Dictionary holding the sheet_names as keys
dfs = {s: pd.concat(pd.read_excel(f, sheet_name=s) for f in files) for s in sheets}
# Only for demo purpose
print(dfs[sheets[0]])
In my example files (named Workbook1, Workbook2) with sheet_names (Sheet 1, Sheet 2) and (Matrix A,B rowbreak 1,2) this prints:
A B
0 1 2
0 1 2

How to increase process speed using read_excel in pandas?

I need use pd.read_excel to process every sheet in one excel file.
But in most cases,I did not know the sheet name.
So I use this to judge how many sheet in excel:
i_sheet_count=0
i=0
try:
df.read_excel('/tmp/1.xlsx',sheetname=i)
i_sheet_count+=1
i+=1
else:
i+=1
print(i_sheet_count)
During the process,I found that the process is quite slow,
So,can read_excel only read limited rows to improve the speed?
I tried nrows but did not work..still slow..
Read all worksheets without guessing
Use sheet_name = None argument to pd.read_excel. This will read all worksheets into a dictionary of dataframes. For example:
dfs = pd.read_excel('file.xlsx', sheet_name=None)
# access 'Sheet1' worksheet
res = dfs['Sheet1']
Limit number of rows or columns
You can use parse_cols and skip_footer arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheet_name = None.
For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.
df = pd.read_excel('file.xlsx', sheet_name=None, parse_cols='A:C', skip_footer=80)
If you wish to apply worksheet-specific logic, you can do so by extracting sheet_names:
sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names
dfs = {}
for sheet in sheet_names:
dfs[sheet] = pd.read_excel('file.xlsx', sheet)
Improving performance
Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.
One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv.
Edit 02 Nov: correct sheetname to sheet_name
I had excel with many sheets. I wanted only those sheets whose state is visible. If you do not know about that its fine. But if you want to read your sheet names from excel then you can use this code. it took me average 3 sec time to read near about 20 sheet names. It takes quite a few attempts to get this.
file_name = r'C:\Users\xyz.xlsx'
File_sheet_list = []
workbookObj = pd.ExcelFile(file_name)
LenOfWorkBook = len(workbookObj.book.worksheets)
idx = 0
for idx in range(0, LenOfWorkBook ):
if workbookObj.book.worksheets[idx].sheet_state == "visible":
File_sheet_list.append(workbookObj.book.worksheets[idx].title)

Categories