Python/Pandas: Iterate over Excel files and extract information

Python/Pandas: Iterate over Excel files and extract information - python

I founds threads on extracting info from various sheets of the same file and solutions to problems similar, but not exactly like mine.
I have a several Excel workbooks each containing several sheets. I would like to iterate over each workbook and extract information from a sheet name "3. Prices". This sheet is available in all files. The pieces of information to be extracted from this sheet in every file are two. The first is always found in cell range E13:H13 and the second from cells F19, I19 and K19.
I would like place the two pieces of extracted information next to one another (for a given file) and then stack the extract from every file on top in one master file. Also, the first column of the combined file should be the file name.
So something like this:
What I've tried so far, with no luck
from openpyxl import load_workbook
import os
import pandas as pd
directory = os.listdir('C:\\User\\files')
for file in directory:
if os.path.isfile(file):
file_name = file[0:3]
workbook = load_workbook(filename = file)
sheet = workbook['3. Prices']
e13 = sheet['E13'].value
f13 = sheet['F13'].value
g13 = sheet['G13'].value
h13 = sheet['H13'].value
f19 = sheet['F19'].value
i19 = sheet['I19'].value
k19 = sheet['K19'].value
df = df.append(pd.DataFrame({
"File_name":file_name,
"E13":e13, "F13":f13, "G13":g13,"H13":h13,
"F19":f19,"I19":i19,"K19":i19,
}, index=[0]))

I figured it out. I was missing two elements: 1) changing the current working directory to match the one in the variable 'directory' and 2) define a dataframe at the start
from openpyxl import load_workbook
import os
import pandas as pd
os.chdir('C:\\User\\files')
directory = os.listdir('C:\\User\\files')
df=pd.DataFrame()
for file in directory:
if os.path.isfile(file):
file_name = file[0:3]
workbook = load_workbook(filename = file, data_only=True)
sheet = workbook['3. Prices']
e13 = sheet['E13'].value
f13 = sheet['F13'].value
g13 = sheet['G13'].value
h13 = sheet['H13'].value
f19 = sheet['F19'].value
i19 = sheet['I19'].value
k19 = sheet['K19'].value
df = df.append(pd.DataFrame({
"File_name":file_name,
"E13":e13, "F13":f13, "G13":g13,"H13":h13,
"F19":f19,"I19":i19,"K19":i19,
}, index=[0]))

Related

pandas creates 2 copies of files in a loop

I have a dataframe like as below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'customer': rng.choice(list('ACD'),size=(5)),
'region': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
I am trying to split the dataframe based on Customer and store it in a folder. Not necessary to understand the full code. The issue is in the last line.
i = 0
for k, v in df.groupby(['Customer']):
print(k.split('#')[0])
LM = k.split('#')[0]
i = i+1
unique_cust_names = '_'.join(v['Customer'].astype(str).unique())
unique_ids = '_'.join(v['Id'].astype(str).unique())
unique_location = '_'.join(v['dumeel'].astype(str).unique())
filename = '_'.join([unique_ids, unique_cust_names, unique_location, LM])
print(filename)
with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
v.to_excel(writer,columns=col_list,index=False)
wb = load_workbook(filename = 'format_sheet.xlsx')
sheet_from =wb.worksheets[0]
wb1 = load_workbook(filename = f"{filename}.xlsx")
sheet_to = wb1.worksheets[0]
copy_styles(sheet_from, sheet_to)
#wb.close()
tab = Table(displayName = "Table1", ref = "A1:" + get_column_letter(wb1.worksheets[0].max_column) + str(wb1.worksheets[0].max_row) )
style = TableStyleInfo(name="TableStyleMedium2", showFirstColumn=False, showLastColumn=False, showRowStripes=True, showColumnStripes=False)
tab.tableStyleInfo = style
wb1.worksheets[0].add_table(tab)
#wb1.worksheets[0].parent.save(f"{filename}.xlsx")
wb1.save("test_files/" + f"{filename}.xlsx") # issue is here
wb1.close()
print("Total number of customers to be emailed is ", i)
Though the code works fine, the issue is in the below line I guess
wb1.save("test_files/" + f"{filename}.xlsx") # issue is here
This creates two copies of files.. One in the current folder as jupyter notebook file and other one inside the test_files folder.
For ex: I see two files called test1.xlsx one in the current folder and one inside the test_files folder (path is test_files/test1.xlsx)
How can I avoid this?
I expect my output to generate/save only 1 file for each customer inside the test_files folder?

The issue is happening because you are referencing 2 different file names one with the prefix "test_files/" and once without it. Best way to handle it will be to define file name as follows
dir_filename = "test_files/" + f"{filename}.xlsx"
and then reference it in the following places
with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
v.to_excel(writer,columns=col_list,index=False)
##
wb1 = load_workbook(filename = f"{filename}.xlsx")
##
wb1.save("test_files/" + f"{filename}.xlsx")
Hope it helps

How to set a looped pd.read_excel to skip if an error returns

I have some code set up to read specific data from every single xlsx file in a folder. However, the excels use two different naming conventions for the sheet I want (for example: Titlepage and Title Page). My current code is this:
import pandas as pd
import string
import glob
import os
directory = 'file path'
files = os.listdir(directory)
list_of_dfs = []
Start of loop
os.chdir('file path')
for file in files:
df = pd.read_excel (file)
df = df.T
df = df.iloc[[1],:24]
list_of_dfs.append(df)
End of loop
data_combined = pd.concat(list_of_dfs)
data_combined.to_excel ('file path/output.xlsx', index=False)
I am thinking of having the loop above twice, once for Title Page, and once for Titlepage. However, the code will error if it cannot find the sheet name specific. Is there anyway to tell python to move to the next xlsx file if no such sheet name is found?
Edit:
I used the code below to make this work.
os.chdir("file path")
for file in files:
try:
df = pd.read_excel (file, sheet_name= 'Titlepage')
df = df.T
df = df.iloc[[1],:24]
list_of_dfs.append(df)
except:
pass
for file in files:
try:
df = pd.read_excel (file, sheet_name= 'Title Page')
df = df.T
df = df.iloc[[1],:24]
list_of_dfs.append(df)
except:
pass

How to save transformed file into new excel file using Openpyxl Python?

I have 3 excel files currently in my working directory. All 3 files has name that ends with "_Updated.xlsx". I wanted to transform the files such that all empty rows in each of the files get deleted. I have created function for it, but the only issue is I cannot save all transformed file using below code. Not sure what is wrong ? The reason for creating new file is I would like to save my raw files.
Python code
import openpyxl
import os
from openpyxl import load_workbook,Workbook
import glob
from pathlib import Path
Excel_file_path="/Excel"
for file in Path(Excel_file_path).glob('*_Updated.xlsx'):
wb=load_workbook(file)
wb_modified = False
for sheet in wb.worksheets:
max_row_in_sheet = sheet.max_row
max_col_in_sheet = sheet.max_column
sheet_modified = False
if max_row_in_sheet > 1:
first_nonempty_row = nonempty_row() # Function to find nonempty row
sheet_modified = del_rows_before(first_nonempty_row) #Function to delete nonempty row
wb_modified = wb_modified or sheet_modified
if wb_modified:
for workbook in workbooks:
for sheet in wb.worksheets:
new_wb = Workbook()
ws = new_wb.active
for row_data in sheet.iter_rows():
for row_cell in row_data:
ws[row_cell.coordinate].value = row_cell.value
new_wb.save("/Excel/"+sheet.title+"_Transformed.xlsx")

In case, if any one is still looking for answer to my above question. Below is the code that worked for me.
import openpyxl
import os
from openpyxl import load_workbook
import glob
from pathlib import Path
Excel_file_path="/Excel"
for file in Path(Excel_file_path).glob('*_Updated.xlsx'):
wb=load_workbook(file)
wb_modified = False
for sheet in wb.worksheets:
max_row_in_sheet = sheet.max_row
max_col_in_sheet = sheet.max_column
sheet_modified = False
if max_row_in_sheet > 1:
first_nonempty_row = get_first_nonempty_row() # Function to find nonempty row
sheet_modified = del_rows_before(first_nonempty_row) #Function to delete nonempty roW
file_name = os.path.basename(file)
wb.save("Excel/"+file_name[:-5]+"_Transformed.xlsx")
wb.close()

Converting all worksheets in an Excel workbook to csv format

My Excel document my.xlsx has two Sheets named Sheet1 and Sheet2. I want to convert all worksheets to csv format using xlsx2csv. I used the following commands:
from xlsx2csv import *
xlsx2csv my.xlsx convert.csv
File "<stdin>", line 1
xlsx2csv my.xlsx convert.csv
^
SyntaxError: invalid syntax
x2c -a my.xlsx my1.csv
File "<stdin>", line 1
x2c -a my.xlsx my1.csv
^
SyntaxError: invalid syntax
Any help, please.

I have not used xlsx2csv before but why don't we try pandas.
Your requirement can be solved like this:
import pandas as pd
for sheet in ['Sheet1', 'Sheet2']:
df = pd.read_excel('my.xlsx', sheetname=sheet)
df.to_csv(sheet + '_output.csv', index=False)

You can do something as the follows:
import pandas as pd
xls_file = pd.ExcelFile('<path_to_your_excel_file>')
sheet_names = xls_file.sheet_names
for sheet in sheet_names:
df = xls_file.parse(sheet)

Xlsx2csv python implementation:
Could only execute Xlsx2csv with sheetid parameter. In order to get sheet names and ids, get_sheet_details was used.
csvfrmxlsx creates csv files for each sheet in csv folder under parent directory.
import pandas as pd
from pathlib import Path
def get_sheet_details(filename):
import os
import xmltodict
import shutil
import zipfile
sheets = []
# Make a temporary directory with the file name
directory_to_extract_to = (filename.with_suffix(''))
os.mkdir(directory_to_extract_to)
# Extract the xlsx file as it is just a zip file
zip_ref = zipfile.ZipFile(filename, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
# Open the workbook.xml which is very light and only has meta data, get sheets from it
path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
with open(path_to_workbook, 'r') as f:
xml = f.read()
dictionary = xmltodict.parse(xml)
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_details = {
'id': sheet['#sheetId'], # can be sheetId for some versions
'name': sheet['#name'] # can be name
}
sheets.append(sheet_details)
# Delete the extracted files directory
shutil.rmtree(directory_to_extract_to)
return sheets
def csvfrmxlsx(xlsxfl, df): # create csv files in csv folder on parent directory
from xlsx2csv import Xlsx2csv
for index, row in df.iterrows():
shnum = row['id']
shnph = xlsxfl.parent / 'csv' / Path(row['name'] + '.csv') # path for converted csv file
Xlsx2csv(str(xlsxfl), outputencoding="utf-8").convert(str(shnph), sheetid=int(shnum))
return
pthfnc = 'c:/xlsx/'
wrkfl = 'my.xlsx'
xls_file = Path(pthfnc + wrkfl)
sheetsdic = get_sheet_details(xls_file) # dictionary with sheet names and ids without opening xlsx file
df = pd.DataFrame.from_dict(sheetsdic)
csvfrmxlsx(xls_file, df) # df with sheets to be converted

From password-protected Excel file to pandas DataFrame

I can open a password-protected Excel file with this:
import sys
import win32com.client
xlApp = win32com.client.Dispatch("Excel.Application")
print "Excel library version:", xlApp.Version
filename, password = sys.argv[1:3]
xlwb = xlApp.Workbooks.Open(filename, Password=password)
# xlwb = xlApp.Workbooks.Open(filename)
xlws = xlwb.Sheets(1) # counts from 1, not from 0
print xlws.Name
print xlws.Cells(1, 1) # that's A1
I'm not sure though how to transfer the information to a pandas dataframe. Do I need to read cells one by one and all, or is there a convenient method for this to happen?

Simple solution
import io
import pandas as pd
import msoffcrypto
passwd = 'xyz'
decrypted_workbook = io.BytesIO()
with open(i, 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password=passwd)
office_file.decrypt(decrypted_workbook)
df = pd.read_excel(decrypted_workbook, sheet_name='abc')
pip install --user msoffcrypto-tool
Exporting all sheets of each excel from directories and sub-directories to seperate csv files
from glob import glob
PATH = "Active Cons data"
# Scaning all the excel files from directories and sub-directories
excel_files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.xlsx'))]
for i in excel_files:
print(str(i))
decrypted_workbook = io.BytesIO()
with open(i, 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password=passwd)
office_file.decrypt(decrypted_workbook)
df = pd.read_excel(decrypted_workbook, sheet_name=None)
sheets_count = len(df.keys())
sheet_l = list(df.keys()) # list of sheet names
print(sheet_l)
for i in range(sheets_count):
sheet = sheet_l[i]
df = pd.read_excel(decrypted_workbook, sheet_name=sheet)
new_file = f"D:\\all_csv\\{sheet}.csv"
df.to_csv(new_file, index=False)

Assuming the starting cell is given as (StartRow, StartCol) and the ending cell is given as (EndRow, EndCol), I found the following worked for me:
# Get the content in the rectangular selection region
# content is a tuple of tuples
content = xlws.Range(xlws.Cells(StartRow, StartCol), xlws.Cells(EndRow, EndCol)).Value
# Transfer content to pandas dataframe
dataframe = pandas.DataFrame(list(content))
Note: Excel Cell B5 is given as row 5, col 2 in win32com. Also, we need list(...) to convert from tuple of tuples to list of tuples, since there is no pandas.DataFrame constructor for a tuple of tuples.

from David Hamann's site (all credits go to him)
https://davidhamann.de/2018/02/21/read-password-protected-excel-files-into-pandas-dataframe/
Use xlwings, opening the file will first launch the Excel application so you can enter the password.
import pandas as pd
import xlwings as xw
PATH = '/Users/me/Desktop/xlwings_sample.xlsx'
wb = xw.Book(PATH)
sheet = wb.sheets['sample']
df = sheet['A1:C4'].options(pd.DataFrame, index=False, header=True).value
df

Assuming that you can save the encrypted file back to disk using the win32com API (which I realize might defeat the purpose) you could then immediately call the top-level pandas function read_excel. You'll need to install some combination of xlrd (for Excel 2003), xlwt (also for 2003), and openpyxl (for Excel 2007) first though. Here is the documentation for reading in Excel files. Currently pandas does not provide support for using the win32com API to read Excel files. You're welcome to open up a GitHub issue if you'd like.

Based on the suggestion provided by #ikeoddy, this should put the pieces together:
How to open a password protected excel file using python?
# Import modules
import pandas as pd
import win32com.client
import os
import getpass
# Name file variables
file_path = r'your_file_path'
file_name = r'your_file_name.extension'
full_name = os.path.join(file_path, file_name)
# print(full_name)
Getting command-line password input in Python
# You are prompted to provide the password to open the file
xl_app = win32com.client.Dispatch('Excel.Application')
pwd = getpass.getpass('Enter file password: ')
Workbooks.Open Method (Excel)
xl_wb = xl_app.Workbooks.Open(full_name, False, True, None, pwd)
xl_app.Visible = False
xl_sh = xl_wb.Worksheets('your_sheet_name')
# Get last_row
row_num = 0
cell_val = ''
while cell_val != None:
row_num += 1
cell_val = xl_sh.Cells(row_num, 1).Value
# print(row_num, '|', cell_val, type(cell_val))
last_row = row_num - 1
# print(last_row)
# Get last_column
col_num = 0
cell_val = ''
while cell_val != None:
col_num += 1
cell_val = xl_sh.Cells(1, col_num).Value
# print(col_num, '|', cell_val, type(cell_val))
last_col = col_num - 1
# print(last_col)
ikeoddy's answer:
content = xl_sh.Range(xl_sh.Cells(1, 1), xl_sh.Cells(last_row, last_col)).Value
# list(content)
df = pd.DataFrame(list(content[1:]), columns=content[0])
df.head()
python win32 COM closing excel workbook
xl_wb.Close(False)

Adding to #Maurice answer to get all the cells in the sheet without having to specify the range
wb = xw.Book(PATH, password='somestring')
sheet = wb.sheets[0] #get first sheet
#sheet.used_range.address returns string of used range
df = sheet[sheet.used_range.address].options(pd.DataFrame, index=False, header=True).value

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Pandas: Iterate over Excel files and extract information - python

Related

pandas creates 2 copies of files in a loop

How to set a looped pd.read_excel to skip if an error returns

How to save transformed file into new excel file using Openpyxl Python?

Converting all worksheets in an Excel workbook to csv format

From password-protected Excel file to pandas DataFrame

Categories

Resources