I've already read Can Pandas read and modify a single Excel file worksheet (tab) without modifying the rest of the file? but here my question is specific to the layout mentioned hereafter.
How to open an Excel file with Pandas, do some modifications, and save it back:
(1) without removing that there is a Filter on the first row
(2) without modifying the "displayed column width" of the columns as displayed in Excel
(3) without removing the formulas which might be present on some cells
?
Here is what I tried, it's a short example (in reality I do more processing with Pandas):
import pandas as pd
df = pd.read_excel('in.xlsx')
df['AB'] = df['A'].astype(str) + ' ' + df['B'].astype(str) # create a new column from 2 others
del df['Date'] # delete columns
del df['Time']
df.to_excel('out.xlsx', index=False)
With this code, the Filter of the first row is removed and the displayed column width are set to a default, which is not very handy (because we would have to manually set the correct width for all columns).
If you are using a machine that has Excel installed on it, then I highly recommend using the flexible xlwings API. This answers all your questions.
Let's assume I have an Excel file called demo.xlxs in the same directory as my program.
app.py
import xlwings as xw # pip install xlwings
import pandas as pd
wb = xw.Book('demo.xlsx')
This will create a initiate an xl workbook instance and open your Excel editor to allow you to invoke Python commands.
Let's assume we have the following dataframe that we want to use to replace the ID and Name column:
new_name
A John_new
B Adams_new
C Mo_new
D Safia_new
wb.sheets['Sheet1']['A1:B1'].value = df
Finally, you can save and close.
wb.save()
wb.close()
I would recommend xlwings, as it interfaces with excel's COM interfaces (like built-in vba), so it is more powerful. I never tested the "preservation of filtering or formula", official doc may provide ways.
For my own use, I just build everything into python, filtering, formulas, so I don't even touch the excel sheet.
Demo:
# [step 0] boiler plate stuff
df = pd.DataFrame(
index=pd.date_range("2020-01-01 11:11:11", periods=100, freq="min"),
columns=list('abc'))
df['a'] = np.random.randn(100, 1)
df['b'] = df['a'] * 2 + 10
# [step 1] google xlwings, and pip/conda install xlwings
# [step 2] open a new excel sheet, no need to save
# (basically this code will indiscriminally wipe whatever sheet that is active on your desktop)
# [step 3] magic, ...and things you can do
import xlwings as xw
wb = xw.books.active
ws = wb.sheets.active
ws.range('A1').current_region.options(index=1).value = df
# I believe this preserves existing formatting, HOWEVER, it will destory filtering
if 1:
# show casing some formatting you can do
active_window = wb.app.api.ActiveWindow
active_window.FreezePanes = False
active_window.SplitColumn = 2 # const_splitcolumn
active_window.SplitRow = 1
active_window.FreezePanes = True
ws.cells.api.Font.Name = 'consolas'
ws.api.Rows(1).Orientation = 60
ws.api.Columns(1).Font.Bold = True
ws.api.Columns(1).Font.ColorIndex = 26
ws.api.Rows(1).Font.Bold = True
ws.api.Rows(1).Borders.Weight = 4
ws.autofit('c') # 'c' means columns, autofitting columns
ws.range(1,1).api.AutoFilter(1)
This is a solution for (1), (2), but not (3) from my original question. (If you have an idea for (3), a comment and/or another answer is welcome).
In this solution, we open the input Excel file two times:
once with openpyxl: this is useful to keep the original layout (which seems totally discarded when reading as a pandas dataframe!)
once as a pandas dataframe df to benefit from pandas' great API to manipulate/modify the data itself. Note: data modification is handier with pandas than with openpyxl because we have vectorization, filtering df[df['foo'] == 'bar'], direct access to the columns by name df['foo'], etc.
The following code modifies the input file and keeps the layout: the first row "Filter" is not removed and the column width of each colum is not modified.
import pandas as pd
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import load_workbook
wb = load_workbook('test.xlsx') # load as openpyxl workbook; useful to keep the original layout
# which is discarded in the following dataframe
df = pd.read_excel('test.xlsx') # load as dataframe (modifications will be easier with pandas API!)
ws = wb.active
df.iloc[1, 1] = 'hello world' # modify a few things
rows = dataframe_to_rows(df, index=False)
for r_idx, row in enumerate(rows, 1):
for c_idx, value in enumerate(row, 1):
ws.cell(row=r_idx, column=c_idx, value=value)
wb.save('test2.xlsx')
I think this is not field of pandas, you must use openpyxl in order to take care of all formatting, blocked_rows, name ranges and so on. Main difference is that you cannot use vectorial computation as in pandas so you need to introduce some loop.
Related
Magicians out there....
I need your help with the best approaches for the below use case.
I have an excel sheet whith lakhs of rows of data and I need to filter it based on some criteria and need to create new multiple tiles.
I am in no mood to do it manually hence started working out on a python script that will do the job for me. however, I am facing a few challenges in achieving the end goal. The challenges are the "Color Formatting" and "comment" added to the cell.
Let's recreate the scenario. I have attached a sample excel sheet for your reference here.
it includes "Indian Cars" data with 4 headers called (Brand, Model, Fuel Type & Transmission Type). I need to filter the data based on "Brand" and create a new excel file (workbook) with the Brand name as the excel file name.
Approach 1:-
First I started with loading an excelsheet into a data frame with Pandas and then filtered the data and exported it, that was quite fast and easy and I loved it. However, I am losing cell colors and added note to the cell (Model & Fuel type)
Note: I tried styling the pandas, however, for some reason, it's not working for me.
Approach 2:-
I though of using Openpyxl & Xlsxwriter, however, the issue is I am unable to filter data and keep comments added to the header.
Approach 3:-
Logically, I can create a copy of my existing sheet and delete the unwanted rows from it and save it with desired name, that should do the job for me. Unfortunately, I am unable to figure out how to achieve it in python.
Kindly share your thoughts on this and help me with right approach... and If I can get a sample code or full code... that would just make my day... :D
This should do the trick. You can change the colors of the headers.
Code for custom styling of the excel added.
import pandas as pd
# function to style the dataframe with some conditons (simple condition for an example you can change or add conditions with multiple rows)
def style_df(row):
values = pd.Series(data=False, index=row.index)
if not pd.isna(row['Transmission Type']):
if row['Transmission Type'].strip() == 'Manual':
return ['background-color : gray; color: red' for _ in values]
elif row['Transmission Type'].strip() == 'Manual, Automatic':
return ['background-color : lightblue; color: green' for _ in values]
return ['' for _ in values]
page = pd.read_excel("Cars_in_india.xlsx", 'Cars in India')
# creating an excel file for each brand
for brand in page.Brand.unique():
writer = pd.ExcelWriter(brand+".xlsx", engine = 'xlsxwriter')
workbook = writer.book
border_fmt = workbook.add_format({'bottom':1, 'top':1, 'left':1, 'right':1})
dataframe = page[page.Brand == brand].copy()
dataframe = dataframe.style.apply(style_df, axis=1)
dataframe.to_excel(writer, index=False, sheet_name=brand)
# dynamic columns sizes
for column in page[page.Brand == brand]:
column_width = max(page[page.Brand == brand][column].astype(str).map(len).max(), len(column))
col_idx = page[page.Brand == brand].columns.get_loc(column)
writer.sheets[brand].set_column(col_idx, col_idx, column_width)
worksheet = writer.sheets[brand]
#applying style to the header columns
worksheet.write(0, 1, "Model", workbook.add_format({'fg_color': '#00FF00'}))
worksheet.write(0, 2, "Fuel Type", workbook.add_format({'fg_color': '#FFA500'}))
# applying borders to the table
worksheet.conditional_format(xlsxwriter.utility.xl_range(0, 0, len(page[page.Brand == brand]), len(page[page.Brand == brand].columns)-1), {'type': 'no_errors', 'format': border_fmt})
writer.save()
You can use openpyxl to read the coments and then write the comments when creating the excel. But you used a type of comment not compatible with the current version of excel that openpyxl uses (you will see the same error in the google cloud editor). Then, the only option is to change the type of the comment or rewrite them in the python code.
Example code:
from openpyxl import load_workbook
wb = load_workbook("Cars_in_india.xlsx")
ws = wb["Cars in India"]
_, comment, comment2, _ = list(ws.rows)[0]
# then after this code:
# worksheet.write(0, 1, "Model", workbook.add_format({'fg_color': '#00FF00'}))
# worksheet.write(0, 2, "Fuel Type", workbook.add_format({'fg_color': '#FFA500'}))
# you can add:
worksheet.write_comment('B1', comment.text)
worksheet.write_comment('C1', comment2.text)
I have a very large excel file that I'm dealing with in python. I have a column where every cell is a different formula. I want to copy the formulas and paste them one column over from column GD to GE.
The issue is that I want to the formulas to update like they do in excel, its just that excel takes a very long time to copy/paste because the file I'm working with is very large.
Any ideas on possibly how to use openpyxl's translator to do this or anything else?
from openpyxl import load_workbook
import pandas as pd
#loads the excel file and is now saved under workbook#
workbook = load_workbook('file.xlsx')
#uses the individual sheets index(first sheet = 0) to work on one sheet at a time#
sheet= workbook.worksheets[8]
#inserts a column at specified index number#
sheet.insert_cols(187)
#naming the new columns#
sheet['GE2']= '20220531'
here is my updated code
from openpyxl import load_workbook
from openpyxl.formula.translate import Translator
#loads the excel file and is now saved under workbook#
workbook = load_workbook('file.xlsx')
#uses the individual sheets index(first sheet = 0) to work on one sheet at a time#
sheet= workbook.worksheets[8]
formula = sheet['GD3'].value
new_formula = Translator(formula, origin= 'GE3').translate_formula("GD3")
sheet['GD2'] = new_formula
for row in sheet.iter_rows(min_col=187, max_col=188):
old, new = row
if new.data_type != "f":
continue
new_formula = Translator(new.value, origin=old.coordinate).translate_formula(new.coordinate)
workbook.save('file.xlsx')
When you add or remove columns and rows, Openpyxl does not manage formulae for you. The reason for this is simple: where should it stop? Managing a "dependency graph" is exactly the kind of functionality that an application like MS Excel provides.
But it is quite easy to do this in your own code using the Formula Translator
# insert the column
formula = ws['GE1'].value
new_formula = Translator(formula, origin="GD1").translate_formula("GE1")
ws['GE1'] = new_formula
It should be fairly straightforward to create a loop for this (check the data type and use cell.coordinate to avoid potential typos or incorrect adjustments.
sheet.insert_cols(187)
for row in ws.iter_rows(min_col=187, max_col=188):
old, new = row
if new.data_type != "f"
continue
new_formula = Translator(new.value, origin=old.coordinate).translate_formula(new.coordinate)
I'm trying to copy/append a dataframe with multiple column headers(similar to the one below) to an existing excel sheet starting from a particular cell AA2
df1 = pd.DataFrame({'sub1': [np.nan,'E',np.nan,'S'],
'sub2': [np.nan,'D',np.nan,'A']})
df2 = pd.DataFrame({'sub1': [np.nan,'D',np.nan,'S'],
'sub2': [np.nan,'C',np.nan,'S']})
df = pd.concat({'Af':df1, 'Dp':df2}, axis=1)
df
I'm thinking of a solution to export this dataframe to an excel starting in that particular cell and use openpyxl to copy the data from one to another - column by column... but not sure if that is the correct approach. any ideas?!
(the excel sheet that I'm working with has formatting and can't make it into a dataframe and use merge)
I've had success manipulating Excel files in the past with xlsxwriter (you will need to pip install this as a dependency first - although it does not need to be explicitly imported).
import io
import pandas as pd
# Load your file here instead
file_bytes = io.BytesIO()
with pd.ExcelWriter(file_bytes, engine = 'xlsxwriter') as writer:
# Write a DataFrame to Excel into specific cells
pd.DataFrame().to_excel(
writer,
sheet_name = 'test_sheet',
startrow = 10, startcol = 5,
index = False
)
# Note: You can repeat any of these operations within the context manager
# and keep adding stuff...
# Add some text to cells as well:
writer.sheets['test_sheet'].write('A1', 'Your text goes here')
file_bytes.seek(0)
# Then write your bytes to a file...
# Overwriting it in your case?
Bonus:
You can add plots too - just write them to a BytesIO object and then call <your_image_bytes>.seek(0) and then use in insert_image() function.
... # still inside ExcelWriter context manager
plot_bytes = io.BytesIO()
# Create plot in matplotlib here
plt.savefig(plot_bytes, format='png') # Instead of plt.show()
plot_bytes.seek(0)
writer.sheets['test_sheet'].insert_image(
5, # Row start
5, # Col start
'some_image_name.png',
options = {'image_data': plot_bytes}
)
The full documentation is really helpful too:
https://xlsxwriter.readthedocs.io/working_with_pandas.html
I'll try to explain my problem with an example:
Let's say I have an Excel file test.xlsx which has five tabs (aka worksheets): Sheet1, Sheet2, Sheet3, Sheet4 and sheet5. I am interested to read and modify data in sheet2.
My sheet2 has some columns whose cells are dropdowns and those dropdown values are defined in sheet4 and sheet5. I don't want to touch sheet4 and sheet5. (I mean sheet4 & sheet5 have some references to cells on Sheet2).
I know that I can read all the sheets in excel file using pd.read_excel('test.xlsx', sheetnames=None) which basically gives all sheets as a dictionary(OrderedDict) of DataFrames.
Now I want to modify my sheet2 and save it without disturbing others. So is it posibble to do this using Python Pandas library.
[UPDATE - 4/1/2019]
I am using Pandas read_excel to read whatever sheet I need from my excel file, validating the data with the data in database and updating the status column in the excelfile.
So for writing back the status column in excel I am using openpyxl as shown in the below pseudo code.
import pandas as pd
import openpyxl
df = pd.read_excel(input_file, sheetname=my_sheet_name)
df = df.where((pd.notnull(df)), None)
write_data = {}
# Doing some validations with the data and building my write_data with key
# as (row_number, column_number) and value as actual value to put in that
# cell.
at the end my write_data looks something like this:
{(2,1): 'Hi', (2,2): 'Hello'}
Now I have defined a seperate class named WriteData for writing data using openpyxl
# WriteData(input_file, sheet_name, write_data)
book = openpyxl.load_workbook(input_file, data_only=True, keep_vba=True)
sheet = book.get_sheet_by_name(sheet_name)
for k, v in write_data.items():
row_num, col_num = k
sheet.cell(row=row_num, column=col_num).value = v
book.save(input_file)
Now when I am doing this operation it is removing all the formulas and diagrams. I am using openpyxl 2.6.2
Please correct me if I am doing anything wrong! Is there any better way to do?
Any help on this will be greatly appreciated :)
To modify a single sheet at a time, you can use pandas excel writer:
sheet2 = pd.read_excel("test.xlsx", sheet = "sheet2")
##modify sheet2 as needed.. then to save it back:
with pd.ExcelWriter("test.xlsx") as writer:
sheet2.to_excel(writer, sheet_name="sheet2")
Right now I'm working on combining Excel sheets into 1 new sheet, using pandas which is working.
The only problem is that the value inside the new Excel sheet are plain numbers instead of the Formulas and I would like the Formulas.
Loading file
directory = os.path.dirname(__file__)
fname = os.path.join(directory, "Reisanalyze.xlsm")
print("Loading %s..." % fname)
sheet1 = pd.read_excel(fname, sheetname="Input")
sheet2 = pd.read_excel(fname, sheetname="Alternatieven")
Write to new sheet
writer = pd.ExcelWriter('first_sheet.xlsx', engine='xlsxwriter')`**
sheet1.to_excel(writer, sheet_name='Input', merge_cells=False, startrow=0, startcol=0)
sheet2.to_excel(writer, sheet_name='Input', merge_cells=False, startrow=0, startcol=21)
I originally tried to use the pycel project which worked until I needed to load multiple sheets, which didn't work. That's why I'm using pandas to write multiple sheets into 1 sheet.
You can use OpenPyXL. Read here
Following is the test excel file testexl.xlsx
A | B
---------- | ------
=SUM(B1:B2)| 1
| 2
Following is the test code
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = 'testexl.xlsx')
sheet_names = wb.get_sheet_names()
name = sheet_names[0]
sheet_ranges = wb[name]
df = pd.DataFrame(sheet_ranges.values)
print df
Output
0 1
0 =SUM(B1:B2) 1
1 None 2
If you want to keep excel formulas, then you will need to stop them from being formulas and then convert them back afterwards.
To do this, before conversion, on your keyboard, do control/command+F then a menu should come up in the middle of the screen then click the replace tab.
In the "find What:" box type "=" and and in the "replace with:" box type ".=". Then do replace all.
This will turn the formulas to text for you to copy.
Save it as a csv file
Note: I know that this will also replace = signs inside of formula. It doesn't matter, it'll go.
After you merge them, open it back up in excel, repeat but in reverse to convert them back into formulas.
This might be easier than importing extra modules.