How to create a excel spreadsheet of indeterminate length with dataframe? - python

I am a bit of a new user to python and I have been writing a program where I need to create an excel spreadsheet with an indeterminate number of columns. So my previous code to create exactly 4 columns was:
writer = pd.ExcelWriter(datapath + 'Test#' + str(testcount) + '.xlsx', engine = 'xlsxwriter')
df1 = pd.DataFrame({'Species 1' : evolution[0]})
df2 = pd.DataFrame({'Species 2' : evolution[1]})
df3 = pd.DataFrame({'Species 3' : evolution[2]})
df4 = pd.DataFrame({'Species 4' : evolution[3]})
df1.to_excel(writer, sheet_name='Raw Data')
df2.to_excel(writer, sheet_name='Raw Data', startcol=2, index=False)
df3.to_excel(writer, sheet_name='Raw Data', startcol=3, index=False)
df4.to_excel(writer, sheet_name='Raw Data', startcol=4, index=False)
writer.save()
(Evolution is a separate function from which I draw the data to populate the columns.) So the above code worked exactly as needed. My attempt at creating a way to have an indeterminate number of columns was this:
writer = pd.ExcelWriter(datapath + 'Test#' + str(testcount) + '.xlsx', engine = 'xlsxwriter')
def do(x):
and1 = x+1
"df" + str(and1) = pd.DataFrame({"Species " + str(i) : evolution[x]})
def do2(x):
and1 = x+1
"df" + str(and1).to_excel(writer, sheet_name='Raw Data', startcol=and1, index=False)
def repeat(times, f):
for i in range(times): f()
repeat(colnumb, do)
repeat(colnumb, do2)
writer.save()
("colnumb" is a predefined variable.) However, this keeps on outputting the following error:
"df" + str(and1) = pd.DataFrame({"Species " + str(i) : evolution[x]})
^
SyntaxError: can't assign to operator
If someone could help me see what is wrong with my attempted solutions or clarify a better way to accomplish my goal I would be very grateful. (Also sorry if I messed up any formatting. This is my first time posting a question here so if I did mess up some convention please let me know.)

You can create just one dataframe with all columns and export it :
data = {'Species 1' : evolution[0], 'Species 2' : evolution[1], 'Species 3' : evolution[2], 'Species 4' : evolution[3]}
writer = pd.ExcelWriter(datapath + 'Test#' + str(testcount) + '.xlsx', engine = 'xlsxwriter')
df = pd.DataFrame(data)
df.to_excel(writer, sheet_name='Raw Data')
writer.save()
It's your question ?

I think you have over-engineered this. The syntax error you are getting is because you are trying to create dynamic variable names to store each dataframe, but you cannot store a value in an expression (i.e. str(and1) cannot hold the value returned by pd.DataFrame({"Species " + str(i) : evolution[x-1]}). Thankfully, this operation is unnecessary, so let's look at the rest of the code.
First, you can combine all your functions into a single iteration so that you only iterate over your columns once, and then perform all the necessary operations for that column. This is best performed by a dict comprehension. Second, it looks like you are currently using DataFrames to hold Series objects. You can actually create your entire dataframe first before exporting it, in full, to Excel.
You call the colnumb variable in your repeat() calls, but I am not sure where it comes from, since you have not created any dataframes with columns yet. Is it the length of evolution? Also, you call evolution a function, but interact with it like a list that is storing results of a previously run function. I'll provide solutions for both scenarios. A much simpler and more efficient version would look like:
If colnumb is some arbitrary number you input
# Create a dictionary with all columns of dynamic Species names storing the relevant value of evolution.
species_dict = {'Species {}'.format(i) : [evolution[i]] for i in range(colnumb)}
# Turn that dict into a dataframe
df = pd.DataFrame(species_dict)
# Output the dataframe to Excel
df.to_excel(datapath + 'Test#' + str(testcount) + 'xlsx', 'Raw Data')
And if colnumb is really just the length of evolution:
# Create dict by enumerating all values of evolution to access both index and value
species_dict = {'Species {}'.format(i): [value] for i, value in evolution.items()}
# Convert dict to dataframe
df = pd.DataFrame(species_dict)
# Output the dataframe to Excel
df.to_excel(datapath + 'Test#' + str(testcount) + 'xlsx', 'Raw Data')

Related

Python Pandas ExcelWriter append to sheet creates a new sheet

I would I really appreciate some help.
I'm trying to use a loop to create sheets, and add data to those sheets for every loop. The position of my data is correct, however Panda ExcelWriter creates a new sheet instead of appending to the one created the first time the loop runs.
I'm a beginner, and right function is over form, so forgive me.
My code:
import pandas as pd
# initial files for dataframes
excel_file = 'output.xlsx'
setup_file = 'setup.xlsx'
# write to excel
output_filename = 'output_final.xlsx'
df = pd.read_excel(excel_file) # create dataframe of entire sheet
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_setup = pd.read_excel(setup_file)
df_setup.columns = df_setup.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_2 = pd.merge(df, df_setup) # Merge data with setup to have krymp size for each wire in dataframe
df_2['wirelabel'] = "'" + df_2['cable'] + "_" + df_2['function_code'] + "-" + df_2['terminal_strip'] + ":" + df_2[
'terminal'] # creates column for the wirelabel by appending columns with set delimiters. #TODO: delimiters to be by inputs.
df_2.sort_values(by=['switchboard']) # sort so we get proper order
switchboard_unique = df.switchboard.unique().tolist() # crate variable containing unique switchboards for printing to excel sheets
def createsheets(output_filename, sheetname, row_start, column_start, df_towrite):
with pd.ExcelWriter(output_filename, engine='openpyxl', mode='a') as writer:
df_towrite.to_excel(writer, sheet_name=sheetname, columns=['wirelabel'], startrow=row_start, startcol=column_start, index=False, header=False)
writer.save()
writer.close()
def sorter():
for s in switchboard_unique:
df_3 = df_2.loc[df_2['switchboard'] == s]
krymp_unique = df_3.krymp.unique().tolist()
krymp_unique.sort()
# print(krymp_unique)
column_start = 0
row_start = 0
for k in krymp_unique:
df_3.loc[df_3['krymp'] == k]
# print(k)
# print(s)
# print(df_3['wirelabel'])
createsheets(output_filename, s, row_start, column_start, df_3)
column_start = column_start + 1
sorter()
current behavior:
if sheetname is = sheet, then my script creates sheet1, sheet2, sheet3..etc.
pictureofcurrent
Wanted behavior
Create a sheet for each item in "df_3", and put data into columns according to the position calculated in column_start. The position in my code works, just goes to the wrong sheet.
pictureofwanted
I hope it's clear what im trying to accomplish, and all help is appriciated.
I tried all example codes i have sound regarding writing to excel.
I know my code is not a work of art, but I will update this post with the answer to my own question for the sake of completeness, and if anyone stumbles on this post.
It turns out i misunderstood the capabilities of the "append" function in Pandas "pd.ExcelWriter". It is not possible to append to a sheet already existing, the sheet will get overwritten though mode is set to 'a'.
Realizing this i changed my code to build a dataframe for the entire sheet (df_sheet), an then call the "createsheets" function in my code. The first version wrote my data column by column.
"Final" code:
import pandas as pd
# initial files for dataframes
excel_file = 'output.xlsx'
setup_file = 'setup.xlsx'
# write to excel
output_filename = 'output_final.xlsx'
column_name = 0
df = pd.read_excel(excel_file) # create dataframe of entire sheet
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_setup = pd.read_excel(setup_file)
df_setup.columns = df_setup.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_2 = pd.merge(df, df_setup) # Merge data with setup to have krymp size for each wire in dataframe
df_2['wirelabel'] = "'" + df_2['cable'] + "_" + df_2['function_code'] + "-" + df_2['terminal_strip'] + ":" + df_2[
'terminal'] # creates column for the wirelabel by appending columns with set delimiters. #TODO: delimiters to be by inputs.
df_2.sort_values(by=['switchboard']) # sort so we get proper order
switchboard_unique = df.switchboard.unique().tolist() # crate variable containing unique switchboards for printing to excel sheets
def createsheets(output_filename, sheetname, df_towrite):
with pd.ExcelWriter(output_filename, engine='openpyxl', mode='a') as writer:
df_towrite.to_excel(writer, sheet_name=sheetname, index=False, header=True)
def to_csv_file(output_filename, df_towrite):
df_towrite.to_csv(output_filename, mode='w', index=False)
def sorter():
for s in switchboard_unique:
df_3 = df_2.loc[df_2['switchboard'] == s]
krymp_unique = df_3.krymp.unique().tolist()
krymp_unique.sort()
column_start = 0
row_start = 0
df_sheet = pd.DataFrame([])
for k in krymp_unique:
df_5 = df_3.loc[df_3['krymp'] == k]
df_4 = df_5.filter(['wirelabel'])
column_name = "krymp " + str(k) + " Tavle: " + str(s)
df_4 = df_4.rename(columns={"wirelabel": column_name})
df_4 = df_4.reset_index(drop=True)
df_sheet = pd.concat([df_sheet, df_4], axis=1)
column_start = column_start + 1
row_start = row_start + len(df_5.index) + 1
createsheets(output_filename, s, df_sheet)
to_csv_file(s + ".csv", df_sheet)
sorter()
Thank you.

Saving multiple dataframes to multiple excel sheets multiple times?

I have a function to save multiple dataframes as multiple tables to single excel workbook sheet:
def multiple_dfs(df_list, sheets, file_name, spaces):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
row = 0
for dataframe in df_list:
dataframe.to_excel(writer,sheet_name=sheets,startrow=row , startcol=0)
row = row + len(dataframe.index) + spaces + 1
writer.save()
If I call this function multiple times to write multiple tables to multiple sheets, I end up with just one workbook and one sheet, the one that was called last:
multiple_dfs(dfs_gfk, 'GFK', 'file_of_tables.xlsx', 1)
multiple_dfs(dfs_top, 'TOP', 'file_of_tables.xlsx', 1)
multiple_dfs(dfs_all, 'Total', 'file_of_tables.xlsx', 1)
So in the end I only have file_of_tables workbook with only Total sheet. I know it's a simple problem, but somehow I just can not think of any elegant solution to this. Can anyone help?
Get writer outside function with with:
def multiple_dfs(df_list, sheets, writer, spaces):
row = 0
for dataframe in df_list:
dataframe.to_excel(writer,sheet_name=sheets,startrow=row , startcol=0)
row = row + len(dataframe.index) + spaces + 1
writer.save()
with pd.ExcelWriter('file_of_tables.xlsx') as writer:
multiple_dfs(dfs_gfk, 'GFK', writer, 1)
multiple_dfs(dfs_top, 'TOP', writer, 1)
multiple_dfs(dfs_all, 'Total', writer, 1)
From the pandas.ExcelWriter documentation:
You can also append to an existing Excel file:
>>> with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
... df.to_excel(writer, sheet_name='Sheet3')
The mode keyword matters when you're creating an instance of the ExcelWriter class.
The mode='w' allocates space for the file (which it creates once you call .save() or .close()) when there isn't one or overwrites one if there is already an existing file.
The mode='a' assumes there's an existing file and appends on to that file. If you want to keep the structure of your code, you have to add a small line like so:
import pandas as pd
import os
def multiple_dfs(df_list, sheets, file_name, spaces):
arg_mode = 'a' if file_name in os.getcwd() else 'w' # line added
writer = pd.ExcelWriter(file_name, engine='xlsxwriter', mode=arg_mode) # added mode argument
row = 0
for dataframe in df_list:
dataframe.to_excel(writer,sheet_name=sheets,startrow=row , startcol=0)
row = row + len(dataframe.index) + spaces + 1
writer.save()
if you then run the following series of code(s):
multiple_dfs(dfs_gfk, 'GFK', 'file_of_tables.xlsx', 1)
multiple_dfs(dfs_top, 'TOP', 'file_of_tables.xlsx', 1)
multiple_dfs(dfs_all, 'Total', 'file_of_tables.xlsx', 1)
the last (and second function call) will not overwrite the data currently written in there. Instead what happens is that the first function call creates the file and then the second and third function call append to that data. Now, your function should work.

Save Pandas DataFrames with formulas to xlsx files

In a Pandas DataFrame i have some "cells" with values and some that need to contain excel formulas. I have read that i can get formulas with
link = 'HYPERLINK("#Groups!A' + str(someInt) + '"; "LINKTEXT")'
xlwt.Formula(link)
and store them in the dataframe.
When i try to save my dataframe as an xlsx file with
writer = pd.ExcelWriter("pandas" + str(fileCounter) + ".xlsx", engine = "xlsxwriter")
df.to_excel(writer, sheet_name = "Paths", index = False)
# insert more sheets here
writer.save()
i get the error:
TypeError: Unsupported type <class 'xlwt.ExcelFormula.Formula'> in write()
So i tried to write my formula as a string to my dataframe but Excel wants to restore the file content and then fills all formula cells with 0's.
Edit: I managed to get it work with regular strings but nevertheless would be interested in a solution for xlwt formulas.
So my question is: How do i save dataframes with formulas to xlsx files?
Since you are using xlsxwriter, strings are parsed as formulas by default ("strings_to_formulas: Enable the worksheet.write() method to convert strings to formulas. The default is True"), so you can simply specify formulas as strings in your dataframe.
Example of a formula column which references other columns in your dataframe:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
writer = pd.ExcelWriter("foo.xlsx", engine="xlsxwriter")
df["product"] = None
df["product"] = (
'=INDIRECT("R[0]C[%s]", 0)+INDIRECT("R[0]C[%s]", 0)'
% (
df.columns.get_loc("col1") - df.columns.get_loc("product"),
df.columns.get_loc("col2") - df.columns.get_loc("product"),
)
)
df.to_excel(writer, index=False)
writer.save()
Produces the following output:
After writing the df using table.to_excel(writer, sheet_name=...), I use write_formula() as in this example (edited to add the full loop). To write all the formulas in your dataframe, read each formula in your dataframe.
# replace the right side below with reading the formula from your dataframe
# e.g., formula_to_write = df.loc(...)`
rows = table.shape[0]
for row_num in range(1 + startrow, rows + startrow + 1):
formula_to_write = '=I{} * (1 - AM{})'.format(row_num+1, row_num+1)
worksheet.write_formula(row_num, col, formula_to_write)`
Later in the code (I seem to recall one of these might be redundant, but I haven't looked it up):
writer.save() workbook.close()
Documentation is here.
you need to save in as usual just keep in mind to write the formula as string.
you can use also f strings with vars.
writer = pd.ExcelWriter(FILE_PATH ,mode='a', if_sheet_exists='overlay')
col_Q_index = 3
best_formula = f'=max(L1,N98,Q{col_Q_index})'
formula_df = pd.DataFrame([[best_formula]])
formula_df.to_excel(writer, sheet_name=SHEET_NAME, startrow=i, startcol=17, index=False, header=False)
writer.save()

How to select multiple columns (but same rows) of xlsx file while looping using Openpyxl?

I have an excel file that looks like this (example)
[Balance Sheet][1]
[1]: https://i.stack.imgur.com/O0WXP.jpg
I would like to extract all the items of this financial statement and write it to a new excel sheet. The output that I want is that all accounts under one column, and all the corresponding numbers in another column
[Intended output][2]
[2]: https://i.stack.imgur.com/nbTtR.jpg
My code so far is:
import openpyxl
fwb=openpyxl.load_workbook('wb.xlsx')
sheet=fwb['Sheet1']
sheet['A9']
for i in range(9,sheet.max_row,1):
items=sheet.cell(row=i, column=1).value
number1=sheet.cell(row=i, column=3).value
number2=sheet.cell(row=i, column=4).value
print(items, number1, number2)
My issue is I want the list of items to be under one column, just like the intended output. Hence I would ideally want items=sheet.chell(row=i, column=1 AND 2).
In openpyxl this is very straightforward:
ws1 is your source worksheet
ws2 is your target worksheet
for row in ws1['A':'B']:
ws2.append((c.value for c in row))
for row in ws1['C':'D']:
ws2.append((c.value for c in row))
Adjust the columns as you need them
I will guess the structure of your worksheet from the code, since you did not specify which ranges contain which data.
Something like this may work for you.
You probably need to adjust some values with +/-1, depending on headers, etc.
row_base1=len(sheet['A'])
nrows2=len(sheet['C'])-9
for i in range(1,nrows2):
row1=row_base1+i
row2=8+i
number1=sheet.cell(row=row2, column=3).value
number2=sheet.cell(row=row2, column=4).value
sheet.cell(row=row1, column=1).value=number1
sheet.cell(row=row1, column=2).value=number2
print(items, number1, number2)
nrows2 might give a number larger then what you actually need, see this.
In that case, you will have to add some detection method inside the loop.
Here's my approach using lambda.
Index using numbers
column = lambda x: sheet[chr(ord('#') + x) + str(i)].value
for i in range(1, sheet.max_row + 1):
print(column(1), column(3), column(4))
Index using alphabets
column = lambda x: sheet[x + str(i)].value
for i in range(1, sheet.max_row + 1):
print(column('A'), column('C'), column('D'))
You might try to use pandas as the following. The result can be saved to excel file, if you want. Run #pip install xlrd first.
import pandas as pd
book1 = pd.ExcelFile('book1.xlsx')
df = pd.read_excel(book1, 'Sheet1')
cols = ['Item', 'Value']
x = df.drop(df.columns[2:], axis=1)
y = df.drop(df.columns[:2], axis=1)
x.columns = cols
y.columns = cols
df2 = pd.concat([x, y], ignore_index=True)
df2.dropna(how='all', inplace=True)
print(df2)
Result1
Also can do this
df2['Index'] = df2.loc[df2['Value'].isnull(), 'Item']
df2.Index.fillna(method='ffill', inplace=True)
df3 = df2.set_index(['Index', 'Item']).dropna()
print(df3)
Result2

How to output my dictionary to an Excel sheet in Python

Background: My first Excel-related script. Using openpyxl.
There is an Excel sheet with loads of different types of data about products in different columns.
My goal is to extract certain types of data from certain columns (e.g. price, barcode, status), assign those to the unique product code and then output product code, price, barcode and status to a new excel doc.
I have succeeded in extracting the data and putting it the following dictionary format:
productData = {'AB123': {'barcode': 123456, 'price': 50, 'status': 'NEW'}
My general thinking on getting this output to a new report is something like this (although I know that this is wrong):
newReport = openpyxl.Workbook()
newSheet = newReport.active
newSheet.title = 'Output'
newSheet['A1'].value = 'Product Code'
newSheet['B1'].value = 'Price'
newSheet['C1'].value = 'Barcode'
newSheet['D1'].value = 'Status'
for row in range(2, len(productData) + 1):
newSheet['A' + str(row)].value = productData[productCode]
newSheet['B' + str(row)].value = productPrice
newSheet['C' + str(row)].value = productBarcode
newSheet['D' + str(row)].value = productStatus
newReport.save('ihopethisworks.xlsx')
What do I actually need to do to output the data?
I would suggest using Pandas for that. It has the following syntax:
df = pd.read_excel('your_file.xlsx')
df['Column name you want'].to_excel('new_file.xlsx')
You can do a lot more with it. Openpyxl might not be the right tool for your task (Openpyxl is too general).
P.S. I would leave this in the comments, but stackoverflow, in their widom decided to let anyone to leave answers, but not to comment.
The logic you use to extract the data is missing but I suspect the best approach is to use it to loop over the two worksheets in parallel. You can then avoid using a dictionary entirely and just append loops to the new worksheet.
Pseudocode:
ws1 # source worksheet
ws2 # new worksheet
product = []
code = ws1[…] # some lookup
barcode = ws1[…]
price = ws1[…]
status = ws1[…]
ws2.append([code, price, barcode, status])
Pandas will work best for this
here are some examples
import pandas as pd
#df columns: Date Open High Low Close Volume
#reading data from an excel
df = pd.read_excel('GOOG-NYSE_SPY.xls')
#set index to the column of your choice, in this case it would be date
df.set_index('Date', inplace = True)
#choosing the columns of your choice for further manipulation
df = df[['Open', 'Close']]
#divide two colums to get the % change
df = (df['Open'] - df['Close']) / df['Close'] * 100
print(df.head())

Categories