How to increase process speed using read_excel in pandas?

How to increase process speed using read_excel in pandas? - python

I need use pd.read_excel to process every sheet in one excel file.
But in most cases,I did not know the sheet name.
So I use this to judge how many sheet in excel:
i_sheet_count=0
i=0
try:
df.read_excel('/tmp/1.xlsx',sheetname=i)
i_sheet_count+=1
i+=1
else:
i+=1
print(i_sheet_count)
During the process,I found that the process is quite slow,
So,can read_excel only read limited rows to improve the speed?
I tried nrows but did not work..still slow..

Read all worksheets without guessing
Use sheet_name = None argument to pd.read_excel. This will read all worksheets into a dictionary of dataframes. For example:
dfs = pd.read_excel('file.xlsx', sheet_name=None)
# access 'Sheet1' worksheet
res = dfs['Sheet1']
Limit number of rows or columns
You can use parse_cols and skip_footer arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheet_name = None.
For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.
df = pd.read_excel('file.xlsx', sheet_name=None, parse_cols='A:C', skip_footer=80)
If you wish to apply worksheet-specific logic, you can do so by extracting sheet_names:
sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names
dfs = {}
for sheet in sheet_names:
dfs[sheet] = pd.read_excel('file.xlsx', sheet)
Improving performance
Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.
One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv.
Edit 02 Nov: correct sheetname to sheet_name

I had excel with many sheets. I wanted only those sheets whose state is visible. If you do not know about that its fine. But if you want to read your sheet names from excel then you can use this code. it took me average 3 sec time to read near about 20 sheet names. It takes quite a few attempts to get this.
file_name = r'C:\Users\xyz.xlsx'
File_sheet_list = []
workbookObj = pd.ExcelFile(file_name)
LenOfWorkBook = len(workbookObj.book.worksheets)
idx = 0
for idx in range(0, LenOfWorkBook ):
if workbookObj.book.worksheets[idx].sheet_state == "visible":
File_sheet_list.append(workbookObj.book.worksheets[idx].title)

Related

Read all Excel sheets except one of them

I'm using this line code to get all sheets from an Excel file:
excel_file = pd.read_excel('path_file',skiprows=35,sheet_name=None)
sheet_name=None option gets all the sheets.
How do I get all sheets except one of them?

If all you want to do is exclude one of the sheets, there is not much to change from your base code.
Assume file.xlsx is an excel file with multiple sheets, and you want to skip 'Sheet1'.
One possible solution is as follows:
import pandas as pd
# Returns a dictionary with key:value := sheet_name:df
xlwb = pd.read_excel('file.xlsx', sheet_name=None)
unwanted_sheet = 'Sheet1'
# list comprehension that filters out unwanted sheet
# all other sheets are kept in df_generator
df_generator = (items for keys, items in xlwb.items()
if keys != unwanted_sheet)
# get to the actual dataframes
for df in df_generator:
print(df.head())

How to read separate Excel sheets into separate DataFrames?

I have an Excel file with 13 tabs, and I want to write a function that takes specified sheets from the file, converts them into separate DataFrames, then bundles them into a list of DataFrames. In this case, I want to take the sheets labeled 'tblProviderDetails', 'tblSubmissionStatus', and 'Data Validation Ref Data', convert them into DataFrames and make a list. The reason I want the dfs in a list, is because I want to eventually want to take the input dfs and return a dictionary which will then be used to create a YAML file.
This is ultimately what I want:
dfs = [ 'tblProviderDetails', 'tblSubmissionStatus', 'Data Validation Ref Data']
The reason that I want to use a user-defined function is that I want the flexibility to call any sheet and any number of sheets into a list.
I was able to write a function that converts single specified sheets to dataframes, but I'm not sure how to call any number of sheets in the Excel file or create a list within the function. This is as far as I've gotten:
def read_excel(path, sheet_name, header):
dfs = pd.read_excel(path, sheet_name=sheet_name, header=header)
return dfs
df1 = read_excel(path=BASEDIR, sheet_name='tblProviderDetails', header=2)
df2 = read_excel(path=BASEDIR, sheet_name='tblSubmissionStatus', header=2)
df3 = read_excel(path=BASEDIR, sheet_name='Data Validation Ref Data', header=2)
Thank you for your help.

There are multiple ways to do this but perhaps the simplest way is to first get all the sheet names and then in a loop for every sheet name, load the result in a data frame and append it to the required list.
dfList = []
def read_excel(path, h):
xls = pd.ExcelFile(path)
# Now you can access all sheetnames in the file
sheetsList = xls.sheet_names
# ['sheet1', 'sheet2', ...]
for sheet in sheetsList:
dfList.append(pd.read_excel(path, sheet_name=sheet, header
=h))
read_excel('book.xlsx',2)
print(dfList)

You can pass the a list of sheet names and\or sheet number to parameter sheet_name.
def read_excel(path, sheet_name, header):
sheet_name = ['tblProviderDetails','tblSubmissionStatus','Data Validation
Ref Data']
dfs = pd.read_excel(path, sheet_name=sheet_name, header=header)
return dfs

Extract Partial Data from multiple excel sheets in the same workbook using pandas

I have an excel Workbook with more than 200 sheets of data. Sheet names are as shown in the figure. I would like to assign each sheet to an individual variable as a data frame and later extract some required data from each sheet. Extracted information from all the sheet needs to be stored into a single excel sheet As I cannot keep writing 200 times, I would like to know if I can write any function or use for loop to kind of automate this process.
df1 = pd.read_excel("C:\\Users\\RECL\\Documents\\PRADYUMNA\\Experiment Data\\CNN\\CCCV Data.xlsx", sheet_name=5)
df2 = pd.read_excel("C:\\Users\\RECL\\Documents\\PRADYUMNA\\Experiment Data\\CNN\\CCCV Data.xlsx", sheet_name=10)
df3 = pd.read_excel("C:\\Users\\RECL\\Documents\\PRADYUMNA\\Experiment Data\\CNN\\CCCV Data.xlsx", sheet_name=15)
df1 = df1[0::100]
df2 = df2[0::200]
df3 = df3[0::300]
df1
i=0
for i in range(0,1035), i+5 :
df = pd.read_excel(xlsx, sheet_name=i)
df
I tried something like this but isn't working. Please let me know if there is any simple way to do it.
Thank you :)

Not sure exactly what you are trying to do, but an easier way to traverse through the sheet names would be with a for-each loop:
for sheet in input.sheet_names:
Now you can do something for all the sheets no matter their name.
Regarding " would like to assign each sheet to an individual variable" you could use a dictionary:
sheets = {}
for sheet in input.sheet_names:
sheets[sheet] = pd.read_excel(xlsx, sheet)
Now to get a sheet from the dictionary sheets:
sheets.get("15")
Or to traverse all the sheets:
for sheet in sheets:
%do_something eg.%
print(sheet)
This will print the data for each sheet in sheets.
Hope this helps / brings you further

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())

Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)

You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)

It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!

pandas read_excel select rows

thanks to StackOverflow (so basically all of you) I've managed to solve almost all my issues regarding reading excel data to DataFrame, except one... My code goes like this:
df = pd.read_excel(
fileName,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
The thing is that in excel files which I'm parsing last row of dataI want to load is every time on different position. The only way I can identify last row of data which interest me is to look for word "SUMA" in first column of each sheet, and the last row I want to load to df will be n-1 row from the one containing "SUMA". Rows below SUMA also have some irrevelant (for me) information and there can by quite a lot of them so I want to avoid loading them.

If you do it with generators, you could do something like this. This loads the complete DataFrame, but afterwards filters out the lines after 'SUMA', using the trick that True == 1, so you only keep the relevant info. You might need some work afterwards to get the dtypes correct
def read_files(files):
sheetname = 'my_sheet'
for file in files:
yield pd.read_excel(
file,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
def clean_files(dataframes):
summary_text = 'SUMA'
for df in dataframes:
index_after_suma = df.index.str.startswith(summary_text).cumsum()
yield df.loc[~index_after_suma, :]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to increase process speed using read_excel in pandas? - python

Related

Read all Excel sheets except one of them

How to read separate Excel sheets into separate DataFrames?

Extract Partial Data from multiple excel sheets in the same workbook using pandas

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

pandas read_excel select rows

Categories

Resources