I have been working on automating a series of reports in python. I have been trying to create a series of pivot tables from an imported csv (binlift.csv). I have found the Pandas library very useful for this however, I cant seem to find anything that helps me write the Panda created pivot tables to my excel document (Template.xlsx) and was wondering if anyone can help. So far I have the written the following code
import openpyxl
import csv
from datetime import datetime
import datetime
import pandas as pd
import numpy as np
file1 = "Template.xlsx" # template file
file2 = "binlift.csv" # raw data csv
wb1 = openpyxl.load_workbook(file1) # opens template
ws1 = wb1.create_sheet("Raw Data") # create a new sheet in template called Raw Data
summary = wb1.worksheets[0] # variables given to sheets for manipulation
rawdata = wb1.worksheets[1]
headings = ["READER","BEATID","LIFTYEAR","LIFTMONTH","LIFTWEEK","LIFTDAY","TAGGED","UNTAGGEDLIFT","LIFT"]
df = pd.read_csv(file2, names=headings)
pivot_1 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH","LIFTWEEK"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
pivot_2 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH"], values=["TAGGED","UNTAGGEDLIFT"],aggfunc=np.sum)
pivot_3 = pd.pivot_table(df, index=["READER"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
print(pivot_1)
print(pivot_2)
print(pivot_3)
wb1.save('test.xlsx')enter code here
There is an option in pandas to write the 'xlsx' files.
Here basically we get all the indices (at level 0) of the pivot table, and then one by one we go over these indices to subset the table and write that part of the table.
writer = pd.ExcelWriter('output.xlsx')
for manager in pivot_1.index.get_level_values(0).unique():
temp_df = pivot_1.xs(manager, level=0)
temp_df.to_excel(writer, manager)
writer.save()
Related
I have the following code where I want to read data from first sheet of an excel file, and then, according to some category, split each category in a separate sheet.
All is good and the program doesn't show an error, but all the sheets it produces are empty.
import pandas
import os
from openpyxl import load_workbook
import pandas as pd
import xlsxwriter
path = r"C:\Users\acer pc\Desktop\rrrr.xlsx"
os.chdir(r"C:\Users\acer pc\Desktop")
data = pandas.read_excel("rrrr.xlsx")
FileNumber = data["number"].unique()
print(FileNumber)
wb2 = load_workbook('rrrr.xlsx')
for i in FileNumber:
wb2.create_sheet(f'{i}')
wb2.save(r"C:\Users\acer pc\Desktop\rrrr.xlsx")
for i in FileNumber:
rslt_df = data[data['number'] == i]
print(rslt_df)
writer = pd.ExcelWriter(r"C:\Users\acer pc\Desktop\rrrr.xlsx", engine='xlsxwriter')
rslt_df.to_excel(writer, sheet_name=f'{i}', index=False)
wb2.save(r"C:\Users\acer pc\Desktop\rrrr.xlsx")
wb2.close()
Running dataframe.to_excel() automatically saves the dataframe as the last sheet in the Excel file.
Is there a way to save a dataframe as the very first sheet, so that, when you open the spreadsheet, Excel shows it as the first on the left?
The only workaround I have found is to first export an empty dataframe to the tab with the name I want as first, then export the others, then export the real dataframe I want to the tab with the name I want. Example in the code below. Is there a more elegant way? More generically, is there a way to specifically choose the position of the sheet you are exporting to (first, third, etc)?
Of course this arises because the dataframe I want as first is the result of some calculations based on all the others, so I cannot export it.
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
pd.DataFrame().to_excel(writer,'this should be the 1st')
other_df.to_excel(writer,'other df')
first_df.to_excel(writer,'this should be the 1st')
writer.save()
writer.close()
It is possible to re-arrange the sheets after they have been created:
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
other_df.to_excel(writer,'Sheet2')
first_df.to_excel(writer,'Sheet1')
writer.save()
This will give you this output:
Add this before you save the workbook:
workbook = writer.book
workbook.worksheets_objs.sort(key=lambda x: x.name)
I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?
yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.
I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")
How to convert the output I get from a pretty table to pandas dataframe and save it as an excel file.
My code which gets the pretty table output
from prettytable import PrettyTable
prtab = PrettyTable()
prtab.field_names = ['Item_1', 'Item_2']
for item in Items_2:
prtab.add_row([item, difflib.get_close_matches(item, Items_1)])
print(prtab)
I'm trying to convert this to a pandas dataframe however I get an error saying DataFrame constructor not properly called! My code to convert this is shown below
AA = pd.DataFrame(prtab, columns = ['Item_1', 'Item_2']).reset_index()
I found this method recently.
pretty_table.get_csv_string()
this will convert it to a csv string where you could write to a csv file.
I use it like this:
tbl_as_csv = pretty_table.get_csv_string().replace('\r','')
text_file = open("output_path.csv", "w")
n = text_file.write(tbl_as_csv)
text_file.close()
Load the data into a DataFrame first, then export to PrettyTable and Excel:
import io
import difflib
import pandas as pd
import prettytable as pt
data = []
for item in Items_2:
data.append([item, difflib.get_close_matches(item, Items_1)])
df = pd.DataFrame(data, columns=['Item_1', 'Item_2'])
# Export to prettytable
# https://stackoverflow.com/a/18528589/190597 (Ofer)
# Use io.StringIO with Python3, use io.BytesIO with Python2
output = io.StringIO()
df.to_csv(output)
output.seek(0)
print(pt.from_csv(output))
# Export to Excel file
filename = '/tmp/output.xlsx'
writer = pd.ExcelWriter(filename)
df.to_excel(writer,'Sheet1')
I am trying to write to an excel file and then load the result into a DF. However, calculated values are returning N/A in the DF even though when I open the excel sheet they are correctly displayed.
If I open and then save the excel sheet manually after updating using python, loading the dataframe works.
Here is the code:
from openpyxl import load_workbook
import pandas as pd
if __name__ == '__main__':
portfolio_values = getBalances('USD')
wb = load_workbook(filename = 'Client_Portfolio_Tracker.xlsx')
clients = wb['Clients']
clients['F2'] = portfolio_values
clients['G2'] = datetime.datetime.now().strftime("%m/%d/%y %H:%M")
wb.save('Client_Portfolio_Tracker.xlsx')
client_df = pd.read_excel('Client_Portfolio_Tracker.xlsx', sheetname = 'Clients')
print(client_df)
Output of dataframe
Thanks in advance!