Python/Openpyxl copy and paste columns with formula - python

I have a workbook with a worksheet, named sheet #1. I want to copy some columns from sheet #1 and change column orders a little bit.
First, I create a new worksheet, named sheet #2. I can copy and paste from sheet #1 to sheet #2, but I find openpyxl copies formulas exactly as is, so I have a problem. For example, column O in sheet #1 has some formula like:
O3=(M3*1)+(N3*1)
I move column M in sheet #1 to column H in sheet #2 and move column N in sheet #1 to column I in sheet #2. When I move column O in sheet #1 to column M in sheet #2, I have problems. Firstly, column M in sheet #2's formula is still:
M3=(M3*1)+(N3*1)
I have a circular reference issue since I try to use myself to calculate myself. Secondly, if I move column O in sheet #1 to column J in sheet #2, I don't have this circular reference problem, but my formula is still messed up.
I use the following way to copy and paste:
for i in range(0, 1000):
sheet_#2.cell(row=i,column=12).value = sheet_#1.cell(row=i,column=14).value
I have tried data_only with true and false when I call load_workbook as follows.
my_workbook = openpyxl.load_workbook(args.input_file, data_only=False)
Neither works for me. True gets me all zeros in both sheet #1 and sheet #2. False gets me the circular reference problem as described above.
Is there a way to use openpyxl package to solve my problem? I think as long as when copying and pasting, if worksheet name can be added to specify the cells in the formula, my problem is solved, something like this:
M3=("Sheet #1"M3*1)+("Sheet #1"N3*1)
If openpyxl doesn't do the job, is there a better package to solve this problem? pandas?

I will start off by saying I am no expert, but I'll give it a go.
By the sound of your question it seems you may not be familiar with Pandas. I would tackle this with pandas. Also do some additional reading on Pandas it is so powerful! Especially for excel automation.
import pandas as pd
# Read the excel sheets to Pandas DataFrames
DataFrame1 = pd.read_excel("FileName.xlsx", sheetname='sheet_number_1')
DataFrame2 = pd.read_excel("FileName.xlsx", sheetname='sheet_number_2')
You should read your sheet #2 DataFrame and bring over columns that DO NOT have formulas from your sheet#1 DataFrame first.
#You can set columns equal to each other like this.
sheet2df['sheet_2_column_name'] = sheet1df['sheet_1_column_name']
This will bring over all data from whatever sheet 1 column you choose into whatever sheet 2 column you choose.
Now for columns with formulas... You mentioned the formula (M3*1)+(N3*1) will now become (H3*1)+(I3*1) in your sheet#2. I wouldn't bring these columns over using above method instead I would do something like this...
#apply formula down each row in a column
DataFrame2['column_name_to_insert_formula_to'] = DataFrame2.apply(lambda row: '(H{}*1)+(I{}*1)'.format(row.name + 2), axis=1)
In this case you can leave the {} blank. This formula will iterate down the row number in the {} brackets. You are essentially passing .format(row.name +2) which in this case is your row number as a parameter into the brackets. Also we use axis=1 because you want to apply it to each row in a column. axis=1 will do that for us.
More info on Pandas .apply function from the Pandas docs
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
More info on Apply and Lambda usage in Pandas
https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7

Related

How to replace the blank cells in Excel with 0 using Python?

I'm trying to replace the blank cells in excel to 0 using Python. I have a loop script that will check 2 excel files with the same WorkSheet, Column Headers and Values. Now, from the picture attached,
enter image description here
the script will write the count to Column Count in Excel 2 if the value of Column A in Excel 2 matches to the value of Column A in Excel 1. Now, the problem that I have are the values in Column A in Excel 2 that doesn't have a match in Column A in Excel 1, it leaves the Column Count in Excel 2 blank.
Below is a part of the script that will check the 2 excel files I have. I'm trying the suggestion from this link Pandas: replace empty cell to 0 but it doesn't work for me and I get result.fillna(0, inplace=True) NameError: name 'result' is not defined error message. Guidance on how to achieve my goal would be very nice. Thank you in advance.
import pandas as pd
import os
import openpyxl
daily_data = openpyxl.load_workbook('C:/Test.xlsx')
master_data = openpyxl.load_workbook('C:/Source.xlsx')
daily_sheet = daily_data['WorkSheet']
master_sheet = master_data['WorkSheet']
for i in daily_sheet.iter_rows():
Column_A = i[0].value
row_number = i[0].row
for j in master_sheet.iter_rows():
if j[0].value == Column_A:
daily_sheet.cell(row=row_number, column=6).value = j[1].value
#print(j[1].value)
daily_data.save('C:/Test.xlsx')
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)
it seems you've made a few fundamental mistakes in your approach. First off, "result" is an object, specifically its a dataframe that someone else made (from that other post) it is not your dataframe. Thus, you need to run it on your dataframe. In python, we have whats called an object oriented approach, meaning that objects are the key players. .fillna() is a mthod that operates on your object. Thus the usage for a toy example is as follows:
my_df = pd.read_csv(my_path_to_my_df_)
my_df.fillna(0, inplace=True)
also this method is for dataframes thus you will need to convert it from the object the openpyxl library creates, at least thats what i would assume i haven't used this library before. Therefore in your data you would do this:
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)

Start reading from a column where data exists is in excel using pandas [duplicate]

How could I read an Excel file from column AF and onwards? I don't know the last column letter name and the file is too large to constantly keep checking.
df = pd.read_excel(r"Documents\file.xlsx", usecols="AF:")
You can't write it directly in read_excel function so we can only look for other possible options.
For the moment we could write 'AF:XFD' because 'XFD' is the last column in excel, but it returns information that it will be depracated soon and start returning ParseError, so it's not recommended.
You can use other libraries to find the last column, but it doesn't work too fast - excel file is read, then we check last column and after that we create a dataframe.
If I had such problem I would do everything in Pandas, by adding .iloc at the end. We know that 'AF' is 32th column in excel, so:
df = pd.read_excel(r"Documents\file.xlsx").iloc[:, 32:]
It will return all columns from 'AF' till the end without having a noticeable impact on performance.
You can use the xlwings library to determine the last column in which there is data and then replace it in your code line.
import xlwings as xw
app = xw.App(visible = False)
book = app.books.open(r"Documents\file.xlsx")
sheet = xw.Book(r"Documents\file.xlsx").sheets[0]
app.quit()
colNum = sheet.range('AF1').current_region.last_cell.column
df = pd.read_excel(r"Documents\file.xlsx", usecols = list(range(32, colNum, 1)))
where 32 corresponds to the column number for AF.

How can i add rows in a sheet from a dataframe (using pygsheets) without changing the number of columns in the worksheet?

I'm trying to add rows from a dataframe into google sheets Im using python2 and pygsheets. I have 10 columns in the google sheets and 6 columns in my dataframe, and my problem is when I add the rows in the Sheets it deletes the 4 extra columns of my sheets
so this code should add the number of rows of the df in the worksheet (the rows without any content)
import pygsheets
import pandas as pd
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
rows= df.shape[0]
worksheet.add_rows(df)
The code does work but is fitting the grid of sheets to the one of the df.
does anyone know a solution for adding the exact amount of rows in a worksheet and keeping the worksheet columns intact?
I believe your goal as follows.
In your situation, there are 10 columns in Google Spreadsheet.
For this Spreadsheet, you want to append the values from the dataframe which have the 6 columns.
In this situation, you don't want to remove other 4 columns in Spreadsheet.
You want to achieve this using pygsheets of python.
In this case, how about the following flow?
Convert dataframe to a list.
Put the list to Spreadsheet.
When this flow is reflected to the script, it becomes as follows.
Sample script:
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = df.values.tolist()
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
If you want to include the header row, please use the following script.
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = [df.columns.values.tolist()]
values.extend(df.values.tolist())
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
In above script, the values are put from the 1st empty row of the sheet RawData.
And, when overwrite=False is used, the new rows of the number of same rows of values are added.
Reference:
append_table

Python and Excel - OpenPyXL

I am working with Excel using Python and have couple of questions:
Loading Excel Sheet into 2d Array.
In VBA I'd simply do:
dim arrData as Variant
arrData = shtData.Range("A1:E2500")
I would get array(1 to 2500, 1 to 5), which i can easily access for example arrData(1,5) -> row 1 column 5
In Python what i managed to do is:
#declare list
excel_data=[]
#loop to load excel spreadsheet data into 2d Array
#basically I am looping through every row and append it to list
for row in shtData.iter_rows(min_row=5, max_row=50,values_only =True):
excel_data.append(row)
Is there a way to assign row to list, starting from index 1 not 0?
In VBA there is an option Option Base 1.
https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/option-base-statement
Is it the fastest way to operate on Excel Dataset?
I am planning then to loop through let's say 2500 rows and 5 columns -> 12'500 cells.
With VBA it was very efficient to be honest (operating on array in memory).
As I understand, functionsof OpenPyXL:
load_workbook
#only creates REFERENCE TO EXCEL WORKBOOK - it does not open it? or is it "loaded" into memory but what is on the HD is actually intact?
shtData = wkb.worksheets[0]
#again only reference?
shtReport = wkb.create_sheet(title="ReportTable")
#it adds sheet but it adds it in excel which is loaded into memory, only after saving, Excel on HD is actually overwritten?
You can used Pandas and create a dataframe (2D table) from the Excel spreadsheat.
import pandas as pd
df = pd.read_excel("data.xls")
print(df)
print("____________")
print(f'Average sales are: {df["Gross"].values.mean()}')
print(f'Net income for April: {df.at[3, "Net"]}')
print("____________")
df_no_header = pd.read_excel("data.xls",skiprows=1, header=None)
print(df_no_header)
print("____________")
print(f'Net income for April: {df_no_header.at[3, 2]}')
Output:
The Pandas dataframe has many methods that will allow you to access rows and columns and do much more. Setting skiprows=1, header=None will skip the header row. See here.

Create excel worksheet for every unique value in column of dataframe python

I have a VERY large CSV file with 250,000+ records that takes a while to do any analyses on in Excel, so I wanted to splice it into multiple worksheets based on a specific calculated column that I created in pandas.
The specific column is called "Period" and is a string variable in my dataframe in the form of MMM_YYYY (e.g., Jan_2016, Feb_2016, etc.)
I am trying to make something that would have a workbook (let's call it data_by_month.xlsx) have a worksheet for every unique period in the dataframe column "Period," with all matching rows written into the respective worksheet.
This is the logic that I tried:
for row in df:
for period in unique_periods:
if row[38] == period:
with pd.ExcelWriter("data_by_month.xslx") as writer:
df.to_excel(writer, sheet_name = period)
The idea behind this is for every row in the dataframe, go through every period in a list of unique periods, and if the row[38] -- which is the index of Period -- is equal to a period, write it into the data_by_month.xlsx workbook into a specific worksheet.
I know that my code is completely incorrect right now, but it's the general logic that I've been trying to implement. I'm pretty sure I'm referring to the location of the "Period" column in the dataframe incorrectly, since it keeps saying it's out of range. Any advice would be welcome!
Thank you so much!
You should be able to achieve this using a groupby in pandas. For example ...
with pd.ExcelWriter("data_by_month.xlsx") as writer:
for period, data in df.groupby('Period'):
data.to_excel(writer, sheet_name = period)

Categories