Python and Excel - OpenPyXL - python

I am working with Excel using Python and have couple of questions:
Loading Excel Sheet into 2d Array.
In VBA I'd simply do:
dim arrData as Variant
arrData = shtData.Range("A1:E2500")
I would get array(1 to 2500, 1 to 5), which i can easily access for example arrData(1,5) -> row 1 column 5
In Python what i managed to do is:
#declare list
excel_data=[]
#loop to load excel spreadsheet data into 2d Array
#basically I am looping through every row and append it to list
for row in shtData.iter_rows(min_row=5, max_row=50,values_only =True):
excel_data.append(row)
Is there a way to assign row to list, starting from index 1 not 0?
In VBA there is an option Option Base 1.
https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/option-base-statement
Is it the fastest way to operate on Excel Dataset?
I am planning then to loop through let's say 2500 rows and 5 columns -> 12'500 cells.
With VBA it was very efficient to be honest (operating on array in memory).
As I understand, functionsof OpenPyXL:
load_workbook
#only creates REFERENCE TO EXCEL WORKBOOK - it does not open it? or is it "loaded" into memory but what is on the HD is actually intact?
shtData = wkb.worksheets[0]
#again only reference?
shtReport = wkb.create_sheet(title="ReportTable")
#it adds sheet but it adds it in excel which is loaded into memory, only after saving, Excel on HD is actually overwritten?

You can used Pandas and create a dataframe (2D table) from the Excel spreadsheat.
import pandas as pd
df = pd.read_excel("data.xls")
print(df)
print("____________")
print(f'Average sales are: {df["Gross"].values.mean()}')
print(f'Net income for April: {df.at[3, "Net"]}')
print("____________")
df_no_header = pd.read_excel("data.xls",skiprows=1, header=None)
print(df_no_header)
print("____________")
print(f'Net income for April: {df_no_header.at[3, 2]}')
Output:
The Pandas dataframe has many methods that will allow you to access rows and columns and do much more. Setting skiprows=1, header=None will skip the header row. See here.

Related

Write dataframes to multiple excell sheets without overwriting data using python/pandas

I'm puzzled by an error from pandas/excel data saving. My code is reading data from an excel file with multiple sheets into dataframes in a for loop. Some data subsetting are carried on a dataframe during each iteration and the results (2 subet dataframes) appended to the bottom of the original data in specific sheets (at same row position but different columns).
My code:
import os
import pandas as pd
import re
from excel import append_df_to_excel
path = 'orig_vers.xlsx'
xls = pd.ExcelFile(path, engine='openpyxl')
# Loop
for sheet in xls.sheet_names:
try:
df = pd.read_excel(xls,sheet_name=sheet)
#drop empty columns and #drop rows with missing values
df1=df.dropna(how='all', axis=1).dropna(how='any',axis=0)
# Subset data
ver1= some calculation on df1
ver2= some calculation on df1
lastrow = len(df1) # return the last row value
append_df_to_excel(path, ver1, sheet_name=sheet,startrow=lastrow+2, index=False)
append_df_to_excel(path, ver2,sheet_name=sheet,startrow=lastrow+2,startcol=5,index=False)
except Exception:
continue
"append_df_to_excel" is a helper function from here link
The code works well for the first sheet by appending the two result dataframe at the bottom of original data at the specified positions, but no data is appended to the other sheets. If I remove the try and catch lines and then run the code, I get this error:"Error -3 while decompressing data: invalid distance too far back".
My suspicion is that maybe because as from sheet number 2 of the excel file being read into dataframe, the original data have some empty rows which my code removed before subsetting, then excel writer has issues with this line: lastrow = len(df1). Does anyone know the answer to this issue?

How to add the extra row(s) present in Excel Dataframe2 back into Dataframe1 and vice-versa?

I have an Excel file, which contains two sheets i.e. old sheet and new sheet. The task is to generate difference between two sheets and print difference in third sheet.
Old Sheet
New Sheet
Note- Each row contains 1 file data and data is Dictionary type
I am comparing the sheets with this code
writer2= pd.ExcelWriter(fvc, engine='openpyxl', mode='a') #taking value of fvc from GUI
olddf= pd.read_excel(fvc, sheet_name=0,na_filter=False)
newdf= pd.read_excel(fvc, sheet_name=1,na_filter=False)
comparison_values = olddf.values == newdf.values
rows,cols=np.where(comparison_values==False)
for item in zip(rows,cols):
olddf.iloc[item[0], item[1]] = '{} --> {}'.format(olddf.iloc[item[0], item[1]],newdf.iloc[item[0], item[1]])
olddf.to_excel(writer2,sheet_name='Delta Sheet',index=False,header=True)
writer2.save()
Now this is working the way I wanted to when the two dataframes have same number of rows, but if one dataframe has extra rows it is not printing and shows ValueError: not enough values to unpack (expected 2, got 1).
So is it possible that if somehow in a dataframe if it has some extra rows then append those rows in other dataframe so that it can make both the dataframe equal and comparison is possible.
Like in old sheet there are 15 rows of data and in new sheet 17 rows, so that means in new sheet some new file has been added and that's why it can't be compared. So can I print that row in the Final Output.
I've tried merging, isin, difference etc. different solutions from Stackoverflow but nothing seems to working for me.
If needed more clarification about question please do ask.
Thanks for your Help :)

How can i add rows in a sheet from a dataframe (using pygsheets) without changing the number of columns in the worksheet?

I'm trying to add rows from a dataframe into google sheets Im using python2 and pygsheets. I have 10 columns in the google sheets and 6 columns in my dataframe, and my problem is when I add the rows in the Sheets it deletes the 4 extra columns of my sheets
so this code should add the number of rows of the df in the worksheet (the rows without any content)
import pygsheets
import pandas as pd
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
rows= df.shape[0]
worksheet.add_rows(df)
The code does work but is fitting the grid of sheets to the one of the df.
does anyone know a solution for adding the exact amount of rows in a worksheet and keeping the worksheet columns intact?
I believe your goal as follows.
In your situation, there are 10 columns in Google Spreadsheet.
For this Spreadsheet, you want to append the values from the dataframe which have the 6 columns.
In this situation, you don't want to remove other 4 columns in Spreadsheet.
You want to achieve this using pygsheets of python.
In this case, how about the following flow?
Convert dataframe to a list.
Put the list to Spreadsheet.
When this flow is reflected to the script, it becomes as follows.
Sample script:
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = df.values.tolist()
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
If you want to include the header row, please use the following script.
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = [df.columns.values.tolist()]
values.extend(df.values.tolist())
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
In above script, the values are put from the 1st empty row of the sheet RawData.
And, when overwrite=False is used, the new rows of the number of same rows of values are added.
Reference:
append_table

Pandas Groupby - enumerate through Dataframe and copy into new, unique Excel Worksheets

So, the data in the linked picture below is on one sheet in an Excel workbook, that was created by appending a bunch of Excel files together using Pandas (I’ve added the first column “Row#” for illustrative purposes only).
Picture of my dataset:
What I’m trying to do is enumerate through the unique combinations of “Year” and “Scenario” and copy that data into a new workbook. Also, in that new workbook, I want a unique worksheet made for each unique combination along with all its data.
For example, a new excel workbook would be created, the first tab in that workbook would be titled “2020 Actuals” and that worksheet would contain ONLY the first row in the picture above (where it has Year = 2020 and Scenario = Actuals). It would also include all the headers in the above screenshot as well for each new worksheet. The next worksheet in that same workbook would be titled “2020 Plan” and would contain rows 2 – 5. The third worksheet would be titled “2020 Fcsst” and only include rows 6 and 7 (and include the headers). And so on and so on.
So essentially I’m trying to create unique worksheets for each specific combination/concatenation of Columns Year and Scenario. I’m not trying to pivot or sum or aggregate the values in the “Jan” or “Feb” columns in any way. Just trying to slice each unique Year-Scenario combo into a new Excel worksheet. I know this can be done with a "for loop" and a pandas groupby but can't quite get it.
This is as far as I got, but get this error --> TypeError: expected string or bytes-like object
writer = pd.ExcelWriter('test2.xlsx')
grouped = combined.groupby(['Year','Scenario'])
for name, group in grouped:
group.to_excel(writer, sheet_name=name)
writer.save
You don't need to groupby in order to do this, just need to filter the dataset. (If you do need to groupby to get aggregations, do that first and then start w/ the below).
You'll also need to install xlsxwriter by using pip install xlsxwriter
General idea is to find the unique groupings, and iterate through them filtering down the dataset and writing to each sheet.
import pandas as pd
import random
# Create randomized dataframe
df = pd.DataFrame({'Year':[random.choice(['2010', '2011', '2012']) for _ in range(100)],
'Scenario':[random.choice(['Plan', 'Actuals', 'Fcsst']) for _ in range(100)],
'Val':list(range(0, 100))})
# You can sort values here if you want, but you don't have to
df = df.sort_values(['Year', 'Scenario'])
df.head()
Year Scenario Val
5 2010 Actuals 5
14 2010 Actuals 14
31 2010 Actuals 31
64 2010 Actuals 64
69 2010 Actuals 69
# Define your list of unique concatenations of Year and Scenario.
unique_ys = df[['Year', 'Scenario']].drop_duplicates().values.tolist()
#or
unique_ys = list(df.groupby(['Year', 'Scenario']).groups)
unique_ys
[('2010', 'Actuals'),
('2010', 'Fcsst'),
('2010', 'Plan'),
('2011', 'Actuals'),
('2011', 'Fcsst'),
('2011', 'Plan'),
('2012', 'Actuals'),
('2012', 'Fcsst'),
('2012', 'Plan')]
# Initialize a writer object, and determine the filename
writer = pd.ExcelWriter('finance_file.xlsx', engine='xlsxwriter')
# Iterate through unique concatenations, filter the datasets and write them to each sheet.
for list_ in unique_ys:
df[(df.Year == list_[0]) & (df.Scenario == list_[1])].to_excel(writer,
sheet_name=list_[0]+' '+list_[1],
index=False)
# Close the Pandas Excel writer and output the Excel file.
writer.save()

Spurious 'None' cells loaded at the beginning of columns by openpyxl

I've been working on a function in python, using the openpyxl library, that will load columns from a specified sheet in a workbook and do some data conditioning before returning the columns in lists or numpy arrays.
To load the columns, I'm loading the workbook, getting the target sheet, storing the columns, then simply iterating through each column and appending the cell contents to lists:
#open the excel file
wb = openpyxl.load_workbook(fname, read_only = True)
print('\nWorkbook "%s" open...' % (fname))
#get the target sheet
sh = wb.get_sheet_by_name(sheet)
print('Sheet "%s" aquired...' % (sheet))
#store only the desired columns of the sheet
sheetcols = sh.columns
columns = [[] for i in range(L)]
for i in range(L):
columns[i] = sheetcols[cols[i] - 1]
#read selected columns into a list of lists
print('Parsing desired columns of data...')
data = [[] for i in range(L)]
#iterate over the columns
for i in range(L):
#iterate over a specific column
print(len(columns[i]))
for j in range(len(columns[i])):
#store cell contents as a string (for now)
data[i].append(columns[i][j].value)
Some columns will load with several None elements at the beginning of their respective list that do not correspond to the data in the excel file. For example, a column with two empty cells at the beginning (left empty because of header space or whatever) is expected to load with two None elements at the beginning of its list but it might load with five or six None elements instead of just two...
It's consistent every time I run the function. The same columns will have this problem every time, which makes me think there is hidden data of some kind in the excel sheet. I've tried clearing the contents of the cells that are supposed to be empty but no luck.
Does anybody more familiar with the openpyxl module or maybe just excel have thoughts about why these mysterious extra None elements get into the imported data?
The code is incomplete but it's probably worth noting that the behaviour for worksheets with missing cells is necessarily somewhat unpredictable. For example, if a worksheet only has values in the cells from D3:G8 what should its columns be? openpyxl will create cells on-demand for any given range and I suspect that is what you may be seeing.
ws.rows and ws.columns are provided by convenience but you are almost always better working with ws.get_squared_range(…) which should give you few surprises.

Categories