From password-protected Excel file to pandas DataFrame

From password-protected Excel file to pandas DataFrame - python

I can open a password-protected Excel file with this:
import sys
import win32com.client
xlApp = win32com.client.Dispatch("Excel.Application")
print "Excel library version:", xlApp.Version
filename, password = sys.argv[1:3]
xlwb = xlApp.Workbooks.Open(filename, Password=password)
# xlwb = xlApp.Workbooks.Open(filename)
xlws = xlwb.Sheets(1) # counts from 1, not from 0
print xlws.Name
print xlws.Cells(1, 1) # that's A1
I'm not sure though how to transfer the information to a pandas dataframe. Do I need to read cells one by one and all, or is there a convenient method for this to happen?

Simple solution
import io
import pandas as pd
import msoffcrypto
passwd = 'xyz'
decrypted_workbook = io.BytesIO()
with open(i, 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password=passwd)
office_file.decrypt(decrypted_workbook)
df = pd.read_excel(decrypted_workbook, sheet_name='abc')
pip install --user msoffcrypto-tool
Exporting all sheets of each excel from directories and sub-directories to seperate csv files
from glob import glob
PATH = "Active Cons data"
# Scaning all the excel files from directories and sub-directories
excel_files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.xlsx'))]
for i in excel_files:
print(str(i))
decrypted_workbook = io.BytesIO()
with open(i, 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password=passwd)
office_file.decrypt(decrypted_workbook)
df = pd.read_excel(decrypted_workbook, sheet_name=None)
sheets_count = len(df.keys())
sheet_l = list(df.keys()) # list of sheet names
print(sheet_l)
for i in range(sheets_count):
sheet = sheet_l[i]
df = pd.read_excel(decrypted_workbook, sheet_name=sheet)
new_file = f"D:\\all_csv\\{sheet}.csv"
df.to_csv(new_file, index=False)

Assuming the starting cell is given as (StartRow, StartCol) and the ending cell is given as (EndRow, EndCol), I found the following worked for me:
# Get the content in the rectangular selection region
# content is a tuple of tuples
content = xlws.Range(xlws.Cells(StartRow, StartCol), xlws.Cells(EndRow, EndCol)).Value
# Transfer content to pandas dataframe
dataframe = pandas.DataFrame(list(content))
Note: Excel Cell B5 is given as row 5, col 2 in win32com. Also, we need list(...) to convert from tuple of tuples to list of tuples, since there is no pandas.DataFrame constructor for a tuple of tuples.

from David Hamann's site (all credits go to him)
https://davidhamann.de/2018/02/21/read-password-protected-excel-files-into-pandas-dataframe/
Use xlwings, opening the file will first launch the Excel application so you can enter the password.
import pandas as pd
import xlwings as xw
PATH = '/Users/me/Desktop/xlwings_sample.xlsx'
wb = xw.Book(PATH)
sheet = wb.sheets['sample']
df = sheet['A1:C4'].options(pd.DataFrame, index=False, header=True).value
df

Assuming that you can save the encrypted file back to disk using the win32com API (which I realize might defeat the purpose) you could then immediately call the top-level pandas function read_excel. You'll need to install some combination of xlrd (for Excel 2003), xlwt (also for 2003), and openpyxl (for Excel 2007) first though. Here is the documentation for reading in Excel files. Currently pandas does not provide support for using the win32com API to read Excel files. You're welcome to open up a GitHub issue if you'd like.

Based on the suggestion provided by #ikeoddy, this should put the pieces together:
How to open a password protected excel file using python?
# Import modules
import pandas as pd
import win32com.client
import os
import getpass
# Name file variables
file_path = r'your_file_path'
file_name = r'your_file_name.extension'
full_name = os.path.join(file_path, file_name)
# print(full_name)
Getting command-line password input in Python
# You are prompted to provide the password to open the file
xl_app = win32com.client.Dispatch('Excel.Application')
pwd = getpass.getpass('Enter file password: ')
Workbooks.Open Method (Excel)
xl_wb = xl_app.Workbooks.Open(full_name, False, True, None, pwd)
xl_app.Visible = False
xl_sh = xl_wb.Worksheets('your_sheet_name')
# Get last_row
row_num = 0
cell_val = ''
while cell_val != None:
row_num += 1
cell_val = xl_sh.Cells(row_num, 1).Value
# print(row_num, '|', cell_val, type(cell_val))
last_row = row_num - 1
# print(last_row)
# Get last_column
col_num = 0
cell_val = ''
while cell_val != None:
col_num += 1
cell_val = xl_sh.Cells(1, col_num).Value
# print(col_num, '|', cell_val, type(cell_val))
last_col = col_num - 1
# print(last_col)
ikeoddy's answer:
content = xl_sh.Range(xl_sh.Cells(1, 1), xl_sh.Cells(last_row, last_col)).Value
# list(content)
df = pd.DataFrame(list(content[1:]), columns=content[0])
df.head()
python win32 COM closing excel workbook
xl_wb.Close(False)

Adding to #Maurice answer to get all the cells in the sheet without having to specify the range
wb = xw.Book(PATH, password='somestring')
sheet = wb.sheets[0] #get first sheet
#sheet.used_range.address returns string of used range
df = sheet[sheet.used_range.address].options(pd.DataFrame, index=False, header=True).value

Related

How to import multiple csv files based on a list, and append them into multiple worksheets

I have multiple CSV files to be imported into multiple worksheets named in the same as the CSV file.
However, I have difficulties in creating/appending multiple worksheets.
If I use ExcelWriter(pathDestination, mode = 'a'), FileNotFoundError happens.
If I use ExcelWriter(pathDestination), then only the last CSV file will be created in the worksheet.
How shall I improve the code without the need of listing down each csvpath when doing the to_excel?
import openpyxl
import numpy as np
import pandas as pd
import os
pathDestination = 'Downloads/TemplateOne.xlsx'
csvpathI = '2019_27101220_Export.csv'
csvpathII = '2019_27101220_Import.csv'
csvpathIII = '2020_27101220_Export.csv'
csvpathIV = '2020_27101220_Import.csv'
csvpath_list = [csvpathI, csvpathII, csvpathIII, csvpathIV]
for csvpath in csvpath_list:
df = pd.read_csv(csvpath)
conversion_unit = 1000
supplymentry_conversion_unit = 1000
df['quantity_converted'] = np.multiply(df['Quantity'],conversion_unit)
df['supplimentry_quantity_converted'] = np.multiply(df['Supplimentary Quantity'],conversion_unit)
csvnames = os.path.basename(csvpath).split(".")[0]
with pd.ExcelWriter(pathDestination) as writer:
df.to_excel(writer, sheet_name = csvnames, index = False)`

You need to put the loop inside the context manager in order to save each (df) to a separate sheet.
Try this :
conversion_unit = 1000
supplymentry_conversion_unit = 1000
with pd.ExcelWriter(pathDestination) as writer:
for csvpath in csvpath_list:
df = pd.read_csv(csvpath)
df['quantity_converted'] = df['Quantity'].mul(conversion_unit)
df['supplimentry_quantity_converted'] = df['Supplimentary Quantity'].mul(supplymentry_conversion_unit)
df.to_excel(writer, sheet_name = csvpath.split(".")[0], index = False)

pandas creates 2 copies of files in a loop

I have a dataframe like as below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'customer': rng.choice(list('ACD'),size=(5)),
'region': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
I am trying to split the dataframe based on Customer and store it in a folder. Not necessary to understand the full code. The issue is in the last line.
i = 0
for k, v in df.groupby(['Customer']):
print(k.split('#')[0])
LM = k.split('#')[0]
i = i+1
unique_cust_names = '_'.join(v['Customer'].astype(str).unique())
unique_ids = '_'.join(v['Id'].astype(str).unique())
unique_location = '_'.join(v['dumeel'].astype(str).unique())
filename = '_'.join([unique_ids, unique_cust_names, unique_location, LM])
print(filename)
with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
v.to_excel(writer,columns=col_list,index=False)
wb = load_workbook(filename = 'format_sheet.xlsx')
sheet_from =wb.worksheets[0]
wb1 = load_workbook(filename = f"{filename}.xlsx")
sheet_to = wb1.worksheets[0]
copy_styles(sheet_from, sheet_to)
#wb.close()
tab = Table(displayName = "Table1", ref = "A1:" + get_column_letter(wb1.worksheets[0].max_column) + str(wb1.worksheets[0].max_row) )
style = TableStyleInfo(name="TableStyleMedium2", showFirstColumn=False, showLastColumn=False, showRowStripes=True, showColumnStripes=False)
tab.tableStyleInfo = style
wb1.worksheets[0].add_table(tab)
#wb1.worksheets[0].parent.save(f"{filename}.xlsx")
wb1.save("test_files/" + f"{filename}.xlsx") # issue is here
wb1.close()
print("Total number of customers to be emailed is ", i)
Though the code works fine, the issue is in the below line I guess
wb1.save("test_files/" + f"{filename}.xlsx") # issue is here
This creates two copies of files.. One in the current folder as jupyter notebook file and other one inside the test_files folder.
For ex: I see two files called test1.xlsx one in the current folder and one inside the test_files folder (path is test_files/test1.xlsx)
How can I avoid this?
I expect my output to generate/save only 1 file for each customer inside the test_files folder?

The issue is happening because you are referencing 2 different file names one with the prefix "test_files/" and once without it. Best way to handle it will be to define file name as follows
dir_filename = "test_files/" + f"{filename}.xlsx"
and then reference it in the following places
with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
v.to_excel(writer,columns=col_list,index=False)
##
wb1 = load_workbook(filename = f"{filename}.xlsx")
##
wb1.save("test_files/" + f"{filename}.xlsx")
Hope it helps

How to save transformed file into new excel file using Openpyxl Python?

I have 3 excel files currently in my working directory. All 3 files has name that ends with "_Updated.xlsx". I wanted to transform the files such that all empty rows in each of the files get deleted. I have created function for it, but the only issue is I cannot save all transformed file using below code. Not sure what is wrong ? The reason for creating new file is I would like to save my raw files.
Python code
import openpyxl
import os
from openpyxl import load_workbook,Workbook
import glob
from pathlib import Path
Excel_file_path="/Excel"
for file in Path(Excel_file_path).glob('*_Updated.xlsx'):
wb=load_workbook(file)
wb_modified = False
for sheet in wb.worksheets:
max_row_in_sheet = sheet.max_row
max_col_in_sheet = sheet.max_column
sheet_modified = False
if max_row_in_sheet > 1:
first_nonempty_row = nonempty_row() # Function to find nonempty row
sheet_modified = del_rows_before(first_nonempty_row) #Function to delete nonempty row
wb_modified = wb_modified or sheet_modified
if wb_modified:
for workbook in workbooks:
for sheet in wb.worksheets:
new_wb = Workbook()
ws = new_wb.active
for row_data in sheet.iter_rows():
for row_cell in row_data:
ws[row_cell.coordinate].value = row_cell.value
new_wb.save("/Excel/"+sheet.title+"_Transformed.xlsx")

In case, if any one is still looking for answer to my above question. Below is the code that worked for me.
import openpyxl
import os
from openpyxl import load_workbook
import glob
from pathlib import Path
Excel_file_path="/Excel"
for file in Path(Excel_file_path).glob('*_Updated.xlsx'):
wb=load_workbook(file)
wb_modified = False
for sheet in wb.worksheets:
max_row_in_sheet = sheet.max_row
max_col_in_sheet = sheet.max_column
sheet_modified = False
if max_row_in_sheet > 1:
first_nonempty_row = get_first_nonempty_row() # Function to find nonempty row
sheet_modified = del_rows_before(first_nonempty_row) #Function to delete nonempty roW
file_name = os.path.basename(file)
wb.save("Excel/"+file_name[:-5]+"_Transformed.xlsx")
wb.close()

Python/Pandas: Iterate over Excel files and extract information

I founds threads on extracting info from various sheets of the same file and solutions to problems similar, but not exactly like mine.
I have a several Excel workbooks each containing several sheets. I would like to iterate over each workbook and extract information from a sheet name "3. Prices". This sheet is available in all files. The pieces of information to be extracted from this sheet in every file are two. The first is always found in cell range E13:H13 and the second from cells F19, I19 and K19.
I would like place the two pieces of extracted information next to one another (for a given file) and then stack the extract from every file on top in one master file. Also, the first column of the combined file should be the file name.
So something like this:
What I've tried so far, with no luck
from openpyxl import load_workbook
import os
import pandas as pd
directory = os.listdir('C:\\User\\files')
for file in directory:
if os.path.isfile(file):
file_name = file[0:3]
workbook = load_workbook(filename = file)
sheet = workbook['3. Prices']
e13 = sheet['E13'].value
f13 = sheet['F13'].value
g13 = sheet['G13'].value
h13 = sheet['H13'].value
f19 = sheet['F19'].value
i19 = sheet['I19'].value
k19 = sheet['K19'].value
df = df.append(pd.DataFrame({
"File_name":file_name,
"E13":e13, "F13":f13, "G13":g13,"H13":h13,
"F19":f19,"I19":i19,"K19":i19,
}, index=[0]))

I figured it out. I was missing two elements: 1) changing the current working directory to match the one in the variable 'directory' and 2) define a dataframe at the start
from openpyxl import load_workbook
import os
import pandas as pd
os.chdir('C:\\User\\files')
directory = os.listdir('C:\\User\\files')
df=pd.DataFrame()
for file in directory:
if os.path.isfile(file):
file_name = file[0:3]
workbook = load_workbook(filename = file, data_only=True)
sheet = workbook['3. Prices']
e13 = sheet['E13'].value
f13 = sheet['F13'].value
g13 = sheet['G13'].value
h13 = sheet['H13'].value
f19 = sheet['F19'].value
i19 = sheet['I19'].value
k19 = sheet['K19'].value
df = df.append(pd.DataFrame({
"File_name":file_name,
"E13":e13, "F13":f13, "G13":g13,"H13":h13,
"F19":f19,"I19":i19,"K19":i19,
}, index=[0]))

How to read specific sheets from My XLS file in Python

As of now i can read EXCEL file's all sheet.
e.msgbox("select Excel File")
updated_deleted_xls = e.fileopenbox()
book = xlrd.open_workbook(updated_deleted_xls, formatting_info=True)
openfile = e.fileopenbox()
for sheet in book.sheets():
for row in range(sheet.nrows):
for col in range(sheet.ncols):
thecell = sheet.cell(row, 0)
xfx = sheet.cell_xf_index(row, 0)
xf = book.xf_list[xfx]

If you open your editor from the desktop or command line, you would have to specify the file path while trying to read the file:
import pandas as pd
df = pd.read_excel(r'File path', sheet_name='Sheet name')
Alternatively, if you open your editor in the file's directory, then you could read directly using the panda library
import pandas as pd
df = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name='Title Sheet')
df1 = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx',sheet_name='Transactions')
df2 = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name='NewCustomerList')
df3 = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name='CustomerDemographic')
df4 = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name='CustomerAddress')

Maybe Pandaswould be helpful ( the go-to package for data) :
import pandas as pd
df = pd.read_excel('filname.xls', sheet = 0)
Edit: Since a lot of time has passed and pandas matured the arguemnts have change. So for pandas >1.0.0
import pandas as pd
df = pd.read_excel('filname.xls', sheet_name = 0)

You can use book.sheet_by_name() to read specific sheets by their name from xls file.
for name, sheet_name in zip(filename, sheetnumber):
book = xlrd.open_workbook(name)
sheet = book.sheet_by_name(sheet_name)
for row in range(sheet.nrows):
for column in range(sheet.ncols):
thecell = sheet.cell(row, 0)
xfx = sheet.cell_xf_index(row, 0)
xf = book.xf_list[xfx]
filename is the path to your xls file. Specify the sheet number you need to read in sheetnumber.
Alternatively, you could use book.sheet_by_index() and pass argument to return a specific sheet.
From docs:
sheet_by_index(sheetx)
Parameters: sheetx – Sheet index in range(nsheets)
For example:
first_sheet = book.sheet_by_index(0) # returns the first sheet.

You can use either book.sheet_by_name() or book.get_sheet()
Example using get_sheet()
book = xlrd.open_workbook(updated_deleted_xls, formatting_info=True)
sheet = book.get_sheet(0) #Gets the first sheet.
Example using sheet_by_name()
book = xlrd.open_workbook(updated_deleted_xls, formatting_info=True)
sheet_names = book.sheet_names()
xl_sheet = xl_workbook.sheet_by_name(sheet_names[0])
MoreInfo on getting sheet by sheet_by_name

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

From password-protected Excel file to pandas DataFrame - python

Related

How to import multiple csv files based on a list, and append them into multiple worksheets

pandas creates 2 copies of files in a loop

How to save transformed file into new excel file using Openpyxl Python?

Python/Pandas: Iterate over Excel files and extract information

How to read specific sheets from My XLS file in Python

Categories

Resources