I am using pandas/openpyxl to process an excel file and then create a pivot table to add to a new worksheet in the current workbook. When I execute my code, the new sheet gets created but the pivot table does not get added to the sheet.
Here is my code:
worksheet2 = workbook.create_sheet()
worksheet2.title = 'Sheet1'
workbook.save(filename)
excel = pd.ExcelFile(filename)
df = excel.parse(sheetname=0)
df1 = df[['Product Description', 'Supervisor']]
table1 = pd.pivot_table(df1, index = ['Supervisor'],
columns = ['Product Description'],
values = ['Product Description'],
aggfunc = [lambda x: len(x)], fill_value = 0)
print table1
writer = pd.ExcelWriter(filename)
table1.to_excel(writer, 'Sheet1')
writer.save()
workbook.save(filename)
When I print out my table I get this:
<lambda> \
Product Description EXPRESS 10:30 (doc) EXPRESS 10:30 (nondoc)
Supervisor
Building 0 1
Gordon 1 0
Pete 0 0
Vinny A 0 1
Vinny P 0 1
\
Product Description EXPRESS 12:00 (doc) EXPRESS 12:00 (nondoc)
Supervisor
Building 0 4
Gordon 1 2
Pete 1 0
Vinny A 1 1
Vinny P 0 1
Product Description MEDICAL EXPRESS (nondoc)
Supervisor
Building 0
Gordon 1
Pete 0
Vinny A 0
Vinny P 0
I would like the pivot table to look like this: (if my pivot table code won't make it look like this could someone help me make it look like that? I'm not sure how to add the grand total column. It has something to do with the aggfunc portion of the pivot table right?)
You can't do this because openpyxl does not currently support pivot tables. See https://bitbucket.org/openpyxl/openpyxl/issues/295 for further information.
Since pd.pivot_table returns a dataframe, you can just write the dataframe into excel.
Here is how I write my output from a pandas dataframe to an excel template.
Please note that if data is already present in the cells where you are trying to write the dataframe, it will not be overwritten and the dataframe will be written to a new sheet which is my i have included a step to clear existing data from the template. I have not tried to write output on merged cells so that might throw an error.
Setup
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
file_path='Template.xlsx'
book=load_workbook(file_path)
writer = pd.ExcelWriter(file_path, engine='openpyxl')
writer.book = book
sheet_name="Template 1"
sheet=book[sheet_name]
Set first row and first column in the excel template where output is to be pasted.
If my output is to be pasted starting in cell N2, row_start will be 2 and col_start will be 14
row_start=2
col_start=14
Clear existing data in excel template
for c_idx, col in enumerate(df.columns,col_start):
for r_idx in range(row_start,10001):
sheet.cell(row=r_idx, column=c_idx, value="")
Write dataframe to excel template
rows=dataframe_to_rows(df,index=False)
for r_idx, row in enumerate(rows,row_start):
for c_idx, col in enumerate(row,col_start):
sheet.cell(row=r_idx, column=c_idx, value=col)
writer.save()
writer.close()
Related
context:
i have a dataframe that i want to export into an excel workbook. i want the excel to have an additional column (that is not in the dataframe) that is a data validated column in the workbook - i want it to only take in the range 1-9.
dataframe looks something like this:
name
year
trixie
1985
timmy
1990
chester
1993
I want the exported excel sheet to look like this: where the code column only allows a number between 1 and 9 (the excel data validation way) I want to do all of this in python.
name
year
code
trixie
1985
timmy
1990
chester
1993
Please help. THANKS in advance!
I would use pandas.ExcelWriter with worksheet.data_validation from xlswriter :
df["code"] = None
items = list(range(1,10))
max_row, max_col = df.shape
with pd.ExcelWriter("/tmp/file.xlsx") as writer:
df.to_excel(writer, index=False, sheet_name="Sheet1", startrow=0)
wb = writer.book
ws = writer.sheets["Sheet1"]
ws.data_validation(f"C2:C{str(max_row+2)}", {"validate": "list", "source": items})
Output :
Just assign a new column before exporting to excel:
df.assign(code='').to_excel('file.xlsx', index=False)
It's not possible to apply some constraints on the columns (without pain) in Python. Maybe you can use a macro or openpyxl to write an xlsm file.
I have a code where I convert a txt to xlsx, then add a column with formulas and then I want to create a Pivot Table with that information in a different Sheet. The code works without errors but it creates and empty Sheet instead of a Sheet with information.
So the code looks like this:
import numpy as np
import openpyxl
#Transforming our txt to xlsx
path = r"C:\Users\roslber\Desktop\Codes\Python\Projects\Automated routes.xlsx"
rssdata= pd.read_csv("dwp.txt", sep="\t")
rssdata.to_excel(path, index= None , header= True)
#Writing the formula column
wb = openpyxl.load_workbook(filename=path)
ws1 = wb["Sheet1"]
ws1["AC1"] = "CF Weight"
row_count= ws1.max_row
actual_row= 2
while actual_row <= row_count: #writting the formula in every row
r= str(actual_row)
ws1["AC"+r] = "=(O"+r + "*P"+r +"*Q"+r +")/28316.8"
actual_row= actual_row + 1
#Creating a new sheet with the pivot tables
df = pd.read_excel(path, 0, header= 0) #defining pivot table dataframe
wb.create_sheet("Sheet2")
pv_pack = pd.pivot_table(df, values=["actual_service_time"],\
index=["delivery_station_code"], columns=["cluster_prefix"], aggfunc=np.sum) #constructing the pivot table
print(pv_pack)
with pd.ExcelWriter(path, mode="a") as writer:
pv_pack.to_excel(writer, sheet_name="Sheet2")
writer.save() #inserting pivot table in sheet2
wb.save(path)
For data protection reasons I canĀ“t show you the information inside the pivot table but when I print it I can see exactly what I want. The problem is that, although a Sheet2 is created correctly, The information that I can see printed doesn't appear in Sheet2. Why is this happening?
I have checked these questions:
Trouble writing pivot table to excel file
How to save a new sheet in an existing excel file, using Pandas?
Regarding to the first one, apparently openpyxl can't create a Pivot Table, but I actually don't need a Pivot Table format, I just need the pv_pack information in Sheet2 as its shown when I print it.
I tried to change my code to imitate what they did in the second question but it didn't work.
Thank you in advance
Edit answering to RJ Adriaansen:
The information in Sheet1 would look like this:
id order mtd delivery_station_code cluster_prefix actual_service_time
xh aabb1 one 1 One_ 231
xr aabb2 two 2 Two_ 135
xd aabb3 three 3 One_ 80
xh aabb8 two 1 Two_ 205
xp aabb9 three 2 One_ 1
xl aabb10 one 3 Two_ 115
And the code printed in my editor looks like this:
delivery_station_code One_ Two_
1 231 205
2 1 135
3 80 115
with automatically closes the file, so there is no need to try to save it manually. It is also not needed to create the second sheet prior to writing it. Removing writer.save() and moving wb.save(path) up will make the code work.
#Writing the formula column
wb = openpyxl.load_workbook(filename=path)
ws1 = wb["Sheet1"]
ws1["AC1"] = "CF Weight"
row_count= ws1.max_row
actual_row= 2
while actual_row <= row_count: #writting the formula in every row
r= str(actual_row)
ws1["AC"+r] = "=(O"+r + "*P"+r +"*Q"+r +")/28316.8"
actual_row= actual_row + 1
wb.save(path)
#Creating a new sheet with the pivot tables
df = pd.read_excel(path, 0, header= 0) #defining pivot table dataframe
pv_pack = pd.pivot_table(df, values=["actual_service_time"],\
index=["delivery_station_code"], columns=["cluster_prefix"], aggfunc=np.sum) #constructing the pivot table
with pd.ExcelWriter(path, mode="a") as writer:
pv_pack.to_excel(writer, sheet_name="Sheet2")
I have a Python Script that extracts a specific column from an Excel .xls file, but the output has a numbering next to the extracted information, so I would like to know how to format the output so that they don't appear.
My actual code is this:
for i in sys.argv:
file_name = sys.argv[1]
workbook = pd.read_excel(file_name)
df = pd.DataFrame(workbook, columns=['NOM_LOGR_COMPLETO'])
df = df.drop_duplicates()
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
print(df)
My current output:
1 Street Alpha <br>
2 Street Bravo
But the result I need is:
Street Alpha <br>
Street Bravo
without the numbering, just the name of the streets.
Thanks!
I believe you want to have a dataframe without the index. Note that you cannot have a DataFrame without the indexes, they are the whole point of the DataFrame. So for your case, you can adopt:
print(df.values)
to see the dataframe without the index column. To save the output without index, use:
writer = pd.ExcelWriter("dataframe.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name = df, index=False)
writer.save()
where file_name = "dataframe.xlsx" for your case.
Further references can be found at:
How to print pandas DataFrame without index
Printing a pandas dataframe without row number/index
disable index pandas data frame
Python to_excel without row names (index)?
I recieve some Excel files like that :
USA UK
plane cars plane cars
2016 2 7 1 3 # a comment after the last country
2017 3 1 8 4
There is an unknown amount of countries and there can be a comment after the last column.
When I read the Excel file like that...
df = pd.read_excel(
sourceFilePath,
sheet_name = 'Sheet1',
index_col = [0],
header = [0, 1]
)
... I have a value error :
ValueError: Length of new names must be 1, got 2
The problem is I cannot use the usecols param because I don't know how many countries there is before reading my file.
How can I read such a file ?
It's possible Pandas won't be able to fix your special use case, but you can write a program that fixes the spreadsheet using openpyxl. It has really clear documentation, but here's an overview of how to use it:
import openpyxl as xl
wb = xl.load_workbook("ExampleSheet.xlsx")
for sheet in wb.worksheets:
print("Sheet Title => {}".format(sheet.title))
print("Dimensions => {}".format(sheet.dimensions)) # just returns a string
print("Columns: {} <-> {}".format(sheet.min_column, sheet.max_column))
print("Rows: {} <-> {}".format(sheet.min_row, sheet.max_row))
for r in range(sheet.min_row, sheet.max_row + 1):
for c in range(sheet.min_column, sheet.max_column + 1):
if (sheet.cell(r,c).value != None):
print("Cell {}:{} has value {}".format(r,c,sheet.cell(r,c).value))
what about just using pd.read_csv?
once loaded, you can then determine how many columns you have with df.columns
I want to read the data in one column in excel, here is my code:
import xlrd
file_location = "location/file_name.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('sheet')
x = []
for cell in sheet.col[9]:
if isinstance(cell, float):
x.append(cell)
print(x)
It is wrong because there is no method in sheet called col[col.num], but I just want to extract the data from column 8 (column H), what can I do?
If you're not locked with xlrd I would probably have used pandas instead which is pretty good when working with data from anywhere:
import pandas as pd
df = pd.ExcelFile('location/test.xlsx').parse('Sheet1') #you could add index_col=0 if there's an index
x=[]
x.append(df['name_of_col'])
You could then just write the new extracted columns to a new excel file with pandas df.to_excel()
You can get the values of the 8th column like this:
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 7))
By far the easiest way to get all the values in a column using xlrd is the col_values() worksheet method:
x = []
for value in sheet.col_values(8):
if isinstance(value, float):
x.append(value)
(Note that if you want column H, you should use 7, because the indices start at 0.)
Incidentally, you can use col() to get the cell objects in a column:
x = []
for cell in sheet.col(8):
if isinstance(cell.value, float):
x.append(cell.value)
The best place to find this stuff is the official tutorial (which serves as a decent reference for xlrd, xlwt, and xlutils). You could of course also check out the documentation and the source code.
I would recommend to do it as:
import openpyxl
fname = 'file.xlsx'
wb = openpyxl.load_workbook(fname)
sheet = wb.get_sheet_by_name('sheet-name')
for rowOfCellObjects in sheet['C5':'C7']:
for cellObj in rowOfCellObjects:
print(cellObj.coordinate, cellObj.value)
Result: C5 70.82 C6 84.82 C7 96.82
Note: fname refers to excel file, get_sheet_by_name('sheet-name') refers to desired sheet and in sheet['C5':'C7'] ranges are mentioned for columns.
Check out the link for more detail. Code segment taken from here too.
XLRD is good, but for this case you might find Pandas good because it has routines to select columns by using an operator '[ ]'
Complete Working code for your context would be
import pandas as pd
file_location = "file_name.xlsx"
sheet = pd.read_excel(file_location)
print(sheet['Sl'])
Output 1 - For column 'Sl'
0 1
1 2
2 3
Name: Sl, dtype: int64
Output 2 - For column 'Name'
print(sheet['Name'])
0 John
1 Mark
2 Albert
Name: Name, dtype: object
Reference: file_name.xlsx data
Sl Name
1 John
2 Mark
3 Albert