I'm trying to copy/append a dataframe with multiple column headers(similar to the one below) to an existing excel sheet starting from a particular cell AA2
df1 = pd.DataFrame({'sub1': [np.nan,'E',np.nan,'S'],
'sub2': [np.nan,'D',np.nan,'A']})
df2 = pd.DataFrame({'sub1': [np.nan,'D',np.nan,'S'],
'sub2': [np.nan,'C',np.nan,'S']})
df = pd.concat({'Af':df1, 'Dp':df2}, axis=1)
df
I'm thinking of a solution to export this dataframe to an excel starting in that particular cell and use openpyxl to copy the data from one to another - column by column... but not sure if that is the correct approach. any ideas?!
(the excel sheet that I'm working with has formatting and can't make it into a dataframe and use merge)
I've had success manipulating Excel files in the past with xlsxwriter (you will need to pip install this as a dependency first - although it does not need to be explicitly imported).
import io
import pandas as pd
# Load your file here instead
file_bytes = io.BytesIO()
with pd.ExcelWriter(file_bytes, engine = 'xlsxwriter') as writer:
# Write a DataFrame to Excel into specific cells
pd.DataFrame().to_excel(
writer,
sheet_name = 'test_sheet',
startrow = 10, startcol = 5,
index = False
)
# Note: You can repeat any of these operations within the context manager
# and keep adding stuff...
# Add some text to cells as well:
writer.sheets['test_sheet'].write('A1', 'Your text goes here')
file_bytes.seek(0)
# Then write your bytes to a file...
# Overwriting it in your case?
Bonus:
You can add plots too - just write them to a BytesIO object and then call <your_image_bytes>.seek(0) and then use in insert_image() function.
... # still inside ExcelWriter context manager
plot_bytes = io.BytesIO()
# Create plot in matplotlib here
plt.savefig(plot_bytes, format='png') # Instead of plt.show()
plot_bytes.seek(0)
writer.sheets['test_sheet'].insert_image(
5, # Row start
5, # Col start
'some_image_name.png',
options = {'image_data': plot_bytes}
)
The full documentation is really helpful too:
https://xlsxwriter.readthedocs.io/working_with_pandas.html
Related
I've already read Can Pandas read and modify a single Excel file worksheet (tab) without modifying the rest of the file? but here my question is specific to the layout mentioned hereafter.
How to open an Excel file with Pandas, do some modifications, and save it back:
(1) without removing that there is a Filter on the first row
(2) without modifying the "displayed column width" of the columns as displayed in Excel
(3) without removing the formulas which might be present on some cells
?
Here is what I tried, it's a short example (in reality I do more processing with Pandas):
import pandas as pd
df = pd.read_excel('in.xlsx')
df['AB'] = df['A'].astype(str) + ' ' + df['B'].astype(str) # create a new column from 2 others
del df['Date'] # delete columns
del df['Time']
df.to_excel('out.xlsx', index=False)
With this code, the Filter of the first row is removed and the displayed column width are set to a default, which is not very handy (because we would have to manually set the correct width for all columns).
If you are using a machine that has Excel installed on it, then I highly recommend using the flexible xlwings API. This answers all your questions.
Let's assume I have an Excel file called demo.xlxs in the same directory as my program.
app.py
import xlwings as xw # pip install xlwings
import pandas as pd
wb = xw.Book('demo.xlsx')
This will create a initiate an xl workbook instance and open your Excel editor to allow you to invoke Python commands.
Let's assume we have the following dataframe that we want to use to replace the ID and Name column:
new_name
A John_new
B Adams_new
C Mo_new
D Safia_new
wb.sheets['Sheet1']['A1:B1'].value = df
Finally, you can save and close.
wb.save()
wb.close()
I would recommend xlwings, as it interfaces with excel's COM interfaces (like built-in vba), so it is more powerful. I never tested the "preservation of filtering or formula", official doc may provide ways.
For my own use, I just build everything into python, filtering, formulas, so I don't even touch the excel sheet.
Demo:
# [step 0] boiler plate stuff
df = pd.DataFrame(
index=pd.date_range("2020-01-01 11:11:11", periods=100, freq="min"),
columns=list('abc'))
df['a'] = np.random.randn(100, 1)
df['b'] = df['a'] * 2 + 10
# [step 1] google xlwings, and pip/conda install xlwings
# [step 2] open a new excel sheet, no need to save
# (basically this code will indiscriminally wipe whatever sheet that is active on your desktop)
# [step 3] magic, ...and things you can do
import xlwings as xw
wb = xw.books.active
ws = wb.sheets.active
ws.range('A1').current_region.options(index=1).value = df
# I believe this preserves existing formatting, HOWEVER, it will destory filtering
if 1:
# show casing some formatting you can do
active_window = wb.app.api.ActiveWindow
active_window.FreezePanes = False
active_window.SplitColumn = 2 # const_splitcolumn
active_window.SplitRow = 1
active_window.FreezePanes = True
ws.cells.api.Font.Name = 'consolas'
ws.api.Rows(1).Orientation = 60
ws.api.Columns(1).Font.Bold = True
ws.api.Columns(1).Font.ColorIndex = 26
ws.api.Rows(1).Font.Bold = True
ws.api.Rows(1).Borders.Weight = 4
ws.autofit('c') # 'c' means columns, autofitting columns
ws.range(1,1).api.AutoFilter(1)
This is a solution for (1), (2), but not (3) from my original question. (If you have an idea for (3), a comment and/or another answer is welcome).
In this solution, we open the input Excel file two times:
once with openpyxl: this is useful to keep the original layout (which seems totally discarded when reading as a pandas dataframe!)
once as a pandas dataframe df to benefit from pandas' great API to manipulate/modify the data itself. Note: data modification is handier with pandas than with openpyxl because we have vectorization, filtering df[df['foo'] == 'bar'], direct access to the columns by name df['foo'], etc.
The following code modifies the input file and keeps the layout: the first row "Filter" is not removed and the column width of each colum is not modified.
import pandas as pd
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import load_workbook
wb = load_workbook('test.xlsx') # load as openpyxl workbook; useful to keep the original layout
# which is discarded in the following dataframe
df = pd.read_excel('test.xlsx') # load as dataframe (modifications will be easier with pandas API!)
ws = wb.active
df.iloc[1, 1] = 'hello world' # modify a few things
rows = dataframe_to_rows(df, index=False)
for r_idx, row in enumerate(rows, 1):
for c_idx, value in enumerate(row, 1):
ws.cell(row=r_idx, column=c_idx, value=value)
wb.save('test2.xlsx')
I think this is not field of pandas, you must use openpyxl in order to take care of all formatting, blocked_rows, name ranges and so on. Main difference is that you cannot use vectorial computation as in pandas so you need to introduce some loop.
I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())
Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)
You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)
It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!
I have an Excel file that contains a column with header 'Original Translation'. I also have a DataFrame with column 'Original Translation - {language}', based on the language I am using and some manipulations.
My goal is to open the Excel file, write over the column with header 'Original Translation' with all the data from my DataFrame column 'Original Translation - {language}', preserve the original Excel file formatting, and save to a new output folder.
Here is the code I currently have:
def output_formatted_capstan_file(df, original_file, country, language):
# this is where you generate the file:
# https://stackoverflow.com/questions/20219254/how-to-write-to-an-existing-excel-file-without-overwriting-data-using-pandas
try:
print(original_file)
book = load_workbook(original_file)
writer = pd.ExcelWriter(original_file, engine='openpyxl')
writer.book = book
df.to_excel(writer, ['Original Translation - {}'.format(language)])
writer.save()
except:
print('Failed')
I would approach this using the following method.
1) Import the excel file using a function such as pandas.read_excel, thus taking the data from excel into a dataframe. I'll call this exceldf
2) Merge this dataframe with the data you already have in the Pandas DataFrame. I will call your existing translated dataframe translateddf
3) Reorder the newly merged dataframe newdf and then export out the data. Further options for how to reorder are shown here: re-ordering data frame
4) Export the data to Excel. I'll leave you to integrate it into your initial code. For the generic answer to the question, others may want to look into the integrated Pandas options here to_excel
Example Code
import pandas
# Read in the Excel file
exceldf = pandas.read_excel(open('your_xls_xlsx_filename'), sheetname='Sheet 1')
# Create a new dataframe with your merged data, merging on 'key1'.
# We then drop the column of the original translation, as it should no longer be needed
# I've included the rename argument in case you need it.
newdf = exceldf.merge(translateddf, left_on=['key1'], \
right_on=['key1']) \
.rename(columns={'Original Translation {language}': 'Original Translation {language}'}) \
.drop(['Original Translation'], axis=1)
# Re-order your data.
# Note that if you renamed anything above, you have to update it here too
newdf = newdf[['0', '1', '2', 'Original Translation {language}']]
# An example export, that uses the generic implementation, not your specific code
pandas.newdf.to_excel("output.xlsx")
I have an Excel file (.xls format) with 5 sheets, I want to replace the contents of sheet 5 with contents of my pandas data frame.
From your above needs, you will need to use both Python (to export pandas data frame) and VBA (to delete existing worksheet content and copy/paste external data).
With Python: use the to_csv or to_excel methods. I recommend the to_csv method which performs better with larger datasets.
# DF TO EXCEL
from pandas import ExcelWriter
writer = ExcelWriter('PythonExport.xlsx')
yourdf.to_excel(writer,'Sheet5')
writer.save()
# DF TO CSV
yourdf.to_csv('PythonExport.csv', sep=',')
With VBA: copy and paste source to destination ranges.
Fortunately, in VBA you can call Python scripts using Shell (assuming your OS is Windows).
Sub DataFrameImport()
'RUN PYTHON TO EXPORT DATA FRAME
Shell "C:\pathTo\python.exe fullpathOfPythonScript.py", vbNormalFocus
'CLEAR EXISTING CONTENT
ThisWorkbook.Worksheets(5).Cells.Clear
'COPY AND PASTE TO WORKBOOK
Workbooks("PythonExport").Worksheets(1).Cells.Copy
ThisWorkbook.Worksheets(5).Range("A1").Select
ThisWorkbook.Worksheets(5).Paste
End Sub
Alternatively, you can do vice versa: run a macro (ClearExistingContent) with Python. Be sure your Excel file is a macro-enabled (.xlsm) one with a saved macro to delete Sheet 5 content only. Note: macros cannot be saved with csv files.
import os
import win32com.client
from pandas import ExcelWriter
if os.path.exists("C:\Full Location\To\excelsheet.xlsm"):
xlApp=win32com.client.Dispatch("Excel.Application")
wb = xlApp.Workbooks.Open(Filename="C:\Full Location\To\excelsheet.xlsm")
# MACRO TO CLEAR SHEET 5 CONTENT
xlApp.Run("ClearExistingContent")
wb.Save()
xlApp.Quit()
del xl
# WRITE IN DATA FRAME TO SHEET 5
writer = ExcelWriter('C:\Full Location\To\excelsheet.xlsm')
yourdf.to_excel(writer,'Sheet5')
writer.save()
Or you can do like this:
your_df.to_excel( r'C:\Users\full_path\excel_name.xlsx',
sheet_name= 'your_sheet_name'
)
I tested the previous answers found here: Assuming that we want the other four sheets to remain, the previous answers here did not work, because the other four sheets were deleted. In case we want them to remain use xlwings:
import xlwings as xw
import pandas as pd
filename = "test.xlsx"
df = pd.DataFrame([
("a", 1, 8, 3),
("b", 1, 2, 5),
("c", 3, 4, 6),
], columns=['one', 'two', 'three', "four"])
app = xw.App(visible=False)
wb = xw.Book(filename)
ws = wb.sheets["Sheet5"]
ws.clear()
ws["A1"].options(pd.DataFrame, header=1, index=False, expand='table').value = df
# If formatting of column names and index is needed as xlsxwriter does it,
# the following lines will do it (if the dataframe is not multiindex).
ws["A1"].expand("right").api.Font.Bold = True
ws["A1"].expand("down").api.Font.Bold = True
ws["A1"].expand("right").api.Borders.Weight = 2
ws["A1"].expand("down").api.Borders.Weight = 2
wb.save(filename)
app.quit()
Hello I would like to concatenate three excels files xlsx using python.
I have tried using openpyxl, but I don't know which function could help me to append three worksheet into one.
Do you have any ideas how to do that ?
Thanks a lot
Here's a pandas-based approach. (It's using openpyxl behind the scenes.)
import pandas as pd
# filenames
excel_names = ["xlsx1.xlsx", "xlsx2.xlsx", "xlsx3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I'd use xlrd and xlwt. Assuming you literally just need to append these files (rather than doing any real work on them), I'd do something like: Open up a file to write to with xlwt, and then for each of your other three files, loop over the data and add each row to the output file. To get you started:
import xlwt
import xlrd
wkbk = xlwt.Workbook()
outsheet = wkbk.add_sheet('Sheet1')
xlsfiles = [r'C:\foo.xlsx', r'C:\bar.xlsx', r'C:\baz.xlsx']
outrow_idx = 0
for f in xlsfiles:
# This is all untested; essentially just pseudocode for concept!
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(insheet.nrows):
for col_idx in xrange(insheet.ncols):
outsheet.write(outrow_idx, col_idx,
insheet.cell_value(row_idx, col_idx))
outrow_idx += 1
wkbk.save(r'C:\combined.xls')
If your files all have a header line, you probably don't want to repeat that, so you could modify the code above to look more like this:
firstfile = True # Is this the first sheet?
for f in xlsfiles:
insheet = xlrd.open_workbook(f).sheets()[0]
for row_idx in xrange(0 if firstfile else 1, insheet.nrows):
pass # processing; etc
firstfile = False # We're done with the first sheet.
When I combine excel files (mydata1.xlsx, mydata2.xlsx, mydata3.xlsx) for data analysis, here is what I do:
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('myfolder/mydata*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
Then, when I want to save it as one file:
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
Solution with openpyxl only (without a bunch of other dependencies).
This script should take care of merging together an arbitrary number of xlsx documents, whether they have one or multiple sheets. It will preserve the formatting.
There's a function to copy sheets in openpyxl, but it is only from/to the same file. There's also a function insert_rows somewhere, but by itself it won't insert any rows. So I'm afraid we are left to deal (tediously) with one cell at a time.
As much as I dislike using for loops and would rather use something compact and elegant like list comprehension, I don't see how to do that here as this is a side-effect show.
Credit to this answer on copying between workbooks.
#!/usr/bin/env python3
#USAGE
#mergeXLSX.py <a bunch of .xlsx files> ... output.xlsx
#
#where output.xlsx is the unified file
#This works FROM/TO the xlsx format. Libreoffice might help to convert from xls.
#localc --headless --convert-to xlsx somefile.xls
import sys
from copy import copy
from openpyxl import load_workbook,Workbook
def createNewWorkbook(manyWb):
for wb in manyWb:
for sheetName in wb.sheetnames:
o = theOne.create_sheet(sheetName)
safeTitle = o.title
copySheet(wb[sheetName],theOne[safeTitle])
def copySheet(sourceSheet,newSheet):
for row in sourceSheet.rows:
for cell in row:
newCell = newSheet.cell(row=cell.row, column=cell.col_idx,
value= cell.value)
if cell.has_style:
newCell.font = copy(cell.font)
newCell.border = copy(cell.border)
newCell.fill = copy(cell.fill)
newCell.number_format = copy(cell.number_format)
newCell.protection = copy(cell.protection)
newCell.alignment = copy(cell.alignment)
filesInput = sys.argv[1:]
theOneFile = filesInput.pop(-1)
myfriends = [ load_workbook(f) for f in filesInput ]
#try this if you are bored
#myfriends = [ openpyxl.load_workbook(f) for k in range(200) for f in filesInput ]
theOne = Workbook()
del theOne['Sheet'] #We want our new book to be empty. Thanks.
createNewWorkbook(myfriends)
theOne.save(theOneFile)
Tested with openpyxl 2.5.4, python 3.4.
You can simply use pandas and os library to do this.
import pandas as pd
import os
#create an empty dataframe which will have all the combined data
mergedData = pd.DataFrame()
for files in os.listdir():
#make sure you are only reading excel files
if files.endswith('.xlsx'):
data = pd.read_excel(files, index_col=None)
mergedData = mergedData.append(data)
#move the files to other folder so that it does not process multiple times
os.rename(files, 'path to some other folder')
mergedData DF will have all the combined data which you can export in a separate excel or csv file. Same code will work with csv files as well. just replace it in the IF condition
Just to add to p_barill's answer, if you have custom column widths that you need to copy, you can add the following to the bottom of copySheet:
for col in sourceSheet.column_dimensions:
newSheet.column_dimensions[col] = sourceSheet.column_dimensions[col]
I would just post this in a comment on his or her answer but my reputation isn't high enough.