Programming Data to CSV using Pandas - python

I am trying to make a CSV, Excel I followed an online aid however, it appears not to work, and it brings up KeyError: 'Teflon'. Any thoughts why?
Here is the aid I was following Aid
import pandas as pd
import os
def sort_data_frame_by_Teflon_column(dataframe):
dataframe = dataframe.sort_values(by= ['Teflon'])
def sort_data_frame_by_LacticAcid_column(dataframe):
dataframe = dataframe.sort_values(by= ['Lactic Acid'])
def sort_data_frame_by_ExperimentalWeight_column(dataframe):
dataframe = dataframe.sort_values(by= ['Experimental Weight'])
def sort_data_frame_by_Ratio_column(dataframe):
dataframe = dataframe.sort_values(by= ['Ratio of Lactic Acid to Experimental Weight'])
def get_data_in_Teflon(dataframe):
dataframe = dataframe.loc[dataframe['Teflon']]
dataframe = dataframe.sort_values(by=['Teflon'])
def get_data_in_LacticAcid(dataframe):
dataframe = dataframe.loc[dataframe['Lactic Acid']]
dataframe = dataframe.sort_values(by= ['Lactic Acid'])
def get_data_in_ExperimentalWeight(dataframe):
dataframe = dataframe.loc[dataframe['Experimental Weight']]
dataframe = dataframe.sort_values(by= ['Experimental Weight'])
def get_data_in_Ratio(dataframe):
dataframe = dataframe.loc[dataframe['Ratio of Lactic Acid to Experimental Weight']]
dataframe = dataframe.sort_values(by= ['Ratio of Lactic Acid to Experimental Weight'])
path = 'C:\\Users\\Light_Wisdom\\Documents\\Spyder\\Mass-TeflonLacticAcidRatio.csv'
#output_file = open(path,'x')
#text = input("Input Data: ")
#text.replace('\\n', '\n')
#output_file.write(text. replace('\\', ''))
#output_file.close()
csv_file = 'C:\\Users\\Light_Wisdom\\Documents\\Spyder\\Mass-TeflonLacticAcidRatio.csv'
dataframe = pd.read_csv(csv_file)
dataframe = dataframe.set_index('Teflon')
sort_data_frame_by_Teflon_column(dataframe)
sort_data_frame_by_LacticAcid_column(dataframe)
sort_data_frame_by_ExperimentalWeight_column(dataframe)
sort_data_frame_by_Ratio_column(dataframe)
get_data_in_Teflon(dataframe)
get_data_in_LacticAcid(dataframe)
get_data_in_ExperimentalWeight(dataframe)
get_data_in_Ratio(dataframe)
write_to_csv_file_by_pandas("C:\\images\\Trial1.csv", dataframe)
write_to_excel_file_by_pandas("C:\\images\\Trial1.xlsx", dataframe)
#data_frame.to_csv(csv_file_path)
#excel_writer = pd.ExcelWriter(excel_file_path, engine = 'xlsxwriter')
#excel_writer.save()
Here is the CSV:
Teflon,Lactic Acid,Experimental Weight,Ratio of Lactic Acid to Experimental Weight
1.973,.2201,1.56,.14
2.05,.15,.93,.16
1.76,.44,1.56,.28
Edit New Question 7/24/19
I am trying to automate an answer with functions and I was on the attempt when I got this error.
def get_Data():
check = 'No'
while(check == 'Yes'):
row_name = input("What is the row number? ")
row_name = []
data = float(input("Teflon, Lactic_Acid, Expt_Wt, LacticAcid_to_Expt1_Wt: "))
dataframe = []
check = input("Add another row? ")
return row_name,data, dataframe
def row_inputter(row_name,data,dataframe):
row_name.append(data)
dataframe.append(row_name)
return row_name, dataframe
# Define your data
#row1 = [ 1.973, .2201, 1.56, .14]
#row2 = [2.05, .15, .93, .16]
#row3 = [1.76, .44, 1.56, .28]
row_name,data, dataframe = get_Data()
row, df = row_inputter()

I can tell that you are a Pandas beginner. No worries... Here's how you do the first few operations.
The AID that you reference is doing things the old fashioned way, and not leveraging many fine tools already created for working with CSV and XLSX data in and out of Pandas and Python.
XLSXWriter is a fabulous library that reads and writes Pandas data easily.
[XLXSwriter.com][1]https://xlsxwriter.readthedocs.io/working_with_pandas.html
# Do necessary imports
import pandas as pd
import os
import xlsxwriter
# Define your data
expt_data = ["Teflon", "Lactic_Acid", "Expt_Wt", "LacticAcid_to_Exptl_Wt"]
row1 = [ 1.973, .2201, 1.56, .14]
row2 = [2.05, .15, .93, .16]
row3 = [1.76, .44, 1.56, .28]
# Create dataframe using constructor method
df1 = pd.DataFrame([row1, row2, row3], columns=expt_data)
# Output dataframe
df1
# Sort dataframe by Teflon column values and output it
Teflon_Sorted = df1.sort_values(by=["Teflon"])
Teflon_Sorted
# Sort dataframe by Lactic_Acid column values and output it
Lactic_Acid_Sorted = df1.sort_values(by=["Lactic_Acid"])
Lactic_Acid_Sorted
# Sort dataframe by Expt_Wt column values and output it
Expt_Wt_sorted = df1.sort_values(by=["Expt_Wt"])
Expt_Wt_sorted
# Sort dataframe by Expt_Wt column values and output it
LacticAcid_to_Exptl_Wt_sorted = df1.sort_values(by=["LacticAcid_to_Exptl_Wt"])
LacticAcid_to_Exptl_Wt_sorted
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter("Trial1.xlsx", engine='xlsxwriter')
# Convert all dataframes to XlsxWriter Excel objects and then write each to a different worksheet in the workbook created above named "Trial1.xlsx".
Teflon_Sorted.to_excel(writer, sheet_name='Teflon_Sorted')
Lactic_Acid_Sorted.to_excel(writer, sheet_name='Lactic_Acid_Sorted')
Expt_Wt_sorted.to_excel(writer, sheet_name='Expt_Wt_sorted')
LacticAcid_to_Exptl_Wt_sorted.to_excel(writer, sheet_name='LacticAcid_to_Exptl_Wt_sorted')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
# now go to your current directory in your file system where the Jupyter Notebook or Python file is executing and find your file.
# Type !dir in Jupyter cell to list current directory on MS-Windows
!dir
[XLXSwriter.com][1]https://xlsxwriter.readthedocs.io/working_with_pandas.html
Sorry this is not a complete application that does everything you want, but I have limited time. I showed you how to write out your final results. I left it as a learning exercise for you to learn how to read in your data file, rather than creating it "on the fly" inline in your Python program.
My recommendation is to use XLXSwriter for everything related to Excel or Pandas. Follow the fabulous tutorial on the XLSXwriter website. XLSXwriter is probably the best and easiest Python-Pandas-Excel toolkit right now. It does everything programmatically that someone would normally have to do manually ("interactively").

You already set Teflon as index by
dataframe = dataframe.set_index('Teflon')
you dataframe no longer contains that columns. Your function
sort_data_frame_by_Teflon_column()
would fail and through that error.
Also, the other functions like:
def get_data_in_LacticAcid(dataframe):
dataframe = dataframe.loc[dataframe['Lactic Acid']]
dataframe = dataframe.sort_values(by= ['Lactic Acid'])
will likely fail or turns your dataframe to an empty one due to the first line. What exactly are you trying to achieve with those functions?

Related

Writing a Python Created Pivot Table using Pandas to a Excel Document

I have been working on automating a series of reports in python. I have been trying to create a series of pivot tables from an imported csv (binlift.csv). I have found the Pandas library very useful for this however, I cant seem to find anything that helps me write the Panda created pivot tables to my excel document (Template.xlsx) and was wondering if anyone can help. So far I have the written the following code
import openpyxl
import csv
from datetime import datetime
import datetime
import pandas as pd
import numpy as np
file1 = "Template.xlsx" # template file
file2 = "binlift.csv" # raw data csv
wb1 = openpyxl.load_workbook(file1) # opens template
ws1 = wb1.create_sheet("Raw Data") # create a new sheet in template called Raw Data
summary = wb1.worksheets[0] # variables given to sheets for manipulation
rawdata = wb1.worksheets[1]
headings = ["READER","BEATID","LIFTYEAR","LIFTMONTH","LIFTWEEK","LIFTDAY","TAGGED","UNTAGGEDLIFT","LIFT"]
df = pd.read_csv(file2, names=headings)
pivot_1 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH","LIFTWEEK"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
pivot_2 = pd.pivot_table(df, index=["LIFTYEAR", "LIFTMONTH"], values=["TAGGED","UNTAGGEDLIFT"],aggfunc=np.sum)
pivot_3 = pd.pivot_table(df, index=["READER"], values=["TAGGED","UNTAGGEDLIFT","LIFT"],aggfunc=np.sum)
print(pivot_1)
print(pivot_2)
print(pivot_3)
wb1.save('test.xlsx')enter code here
There is an option in pandas to write the 'xlsx' files.
Here basically we get all the indices (at level 0) of the pivot table, and then one by one we go over these indices to subset the table and write that part of the table.
writer = pd.ExcelWriter('output.xlsx')
for manager in pivot_1.index.get_level_values(0).unique():
temp_df = pivot_1.xs(manager, level=0)
temp_df.to_excel(writer, manager)
writer.save()

Python - How to create a pandas Dataframe directly from Smartsheets?

I don't understand how to import a Smartsheet and convert it to a pandas dataframe. I want to manipulate the data from smartsheets, currently I go to smartsheets export to csv and import csv in python but want to eliminate this step so that it can run on a schedule.
import smartsheet
import pandas as pd
access_token ='#################'
smartsheet = Smartsheet(access_token)
sheet = smartsheet.sheets.get('Sheet 1')
pd.DataFrame(sheet)
Here is a simple method to convert a sheet to a dataframe:
def simple_sheet_to_dataframe(sheet):
col_names = [col.title for col in sheet.columns]
rows = []
for row in sheet.rows:
cells = []
for cell in row.cells:
cells.append(cell.value)
rows.append(cells)
data_frame = pd.DataFrame(rows, columns=col_names)
return data_frame
The only issue with creating a dataframe from smartsheets is that for certain column types cell.value and cell.display_value are different. For example, contact columns will either display the name or the email address depending on which is used.
Here is a snippet of what I use when needing to pull in data from Smartsheet into Pandas. Note, I've included garbage collection as I regularly work with dozens of sheets at or near the 200,000 cell limit.
import smartsheet
import pandas as pd
import gc
configs = {'api_key': 0000000,
'value_cols': ['Assigned User']}
class SmartsheetConnector:
def __init__(self, configs):
self._cfg = configs
self.ss = smartsheet.Smartsheet(self._cfg['api_key'])
self.ss.errors_as_exceptions(True)
def get_sheet_as_dataframe(self, sheet_id):
sheet = self.ss.Sheets.get_sheet(sheet_id)
col_map = {col.id: col.title for col in sheet.columns}
# rows = sheet id, row id, cell values or display values
data_frame = pd.DataFrame([[sheet.id, row.id] +
[cell.value if col_map[cell.column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells]
for row in sheet.rows],
columns=['Sheet ID', 'Row ID'] +
[col.title for col in sheet.columns])
del sheet, col_map
gc.collect() # force garbage collection
return data_frame
def get_report_as_dataframe(self, report_id):
rprt = self.ss.Reports.get_report(report_id, page_size=0)
page_count = int(rprt.total_row_count/10000) + 1
col_map = {col.virtual_id: col.title for col in rprt.columns}
data = []
for page in range(1, page_count + 1):
rprt = self.ss.Reports.get_report(report_id, page_size=10000, page=page)
data += [[row.sheet_id, row.id] +
[cell.value if col_map[cell.virtual_column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells] for row in rprt.rows]
del rprt
data_frame = pd.DataFrame(data, columns=['Sheet ID', 'Row ID']+list(col_map.values()))
del col_map, page_count, data
gc.collect()
return data_frame
This adds additional columns for sheet and row IDs so that I can write back to Smartsheet later if needed.
Sheets cannot be retrieved by name, as you've shown in your example code. It is entirely possible for you to have multiple sheets with the same name. You must retrieve them with their sheetId number.
For example:
sheet = smartsheet_client.Sheets.get_sheet(4583173393803140) # sheet_id
http://smartsheet-platform.github.io/api-docs/#get-sheet
Smartsheet sheets have a lot of properties associated with them. You'll need to go through the rows and columns of your sheet to retrieve the information you're looking for, and construct it in a format your other system can recognize.
The API docs contain a listing of properties and examples. As a minimal example:
for row in sheet.rows:
for cell in row.cells
# Do something with cell.object_value here
Get the sheet as a csv:
(https://smartsheet-platform.github.io/api-docs/?python#get-sheet-as-excel-pdf-csv)
smartsheet_client.Sheets.get_sheet_as_csv(
1531988831168388, # sheet_id
download_directory_path)
Read the csv into a DataFrame:
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
pandas.read_csv
You can use this library
Very easy to use and allows Sheets or Reports to be delivered as a Dataframe.
pip install smartsheet-dataframe
Get a report as df
from smartsheet_dataframe import get_as_df, get_report_as_df
df = get_report_as_df(token='smartsheet_auth_token',
report_id=report_id_int)
Get a sheet as df
from smartsheet_dataframe import get_as_df, get_sheet_as_df
df = get_sheet_as_df(token='smartsheet_auth_token',
sheet_id=sheet_id_int)
replace 'smartsheet_auth_token' with your token (numbers and letters)
replace sheet_id_int with your sheet/report id (numbers only)

Import Excel Tables into pandas dataframe

I would like to import excel tables (made by using the Excel 2007 and above tabulating feature) in a workbook into separate dataframes. Apologies if this has been asked before but from my searches I couldn't find what I wanted. I know you can easily do this using the read_excel function however this requires the specification of a Sheetname or returns a dict of dataframes for each sheet.
Instead of specifying sheetname, I was wondering whether there was a way of specifying tablename or better yet return a dict of dataframes for each table in the workbook.
I know this can be done by combining xlwings with pandas but was wondering whether this was built-into any of the pandas functions already (maybe ExcelFile).
Something like this:-
import pandas as pd
xls = pd.ExcelFile('excel_file_path.xls')
# to read all tables to a map
tables_to_df_map = {}
for table_name in xls.table_names:
table_to_df_map[table_name] = xls.parse(table_name)
Although not exactly what I was after, I have found a way to get table names with the caveat that it's restricted to sheet name.
Here's an excerpt from the code that I'm currently using:
import pandas as pd
import openpyxl as op
wb=op.load_workbook(file_location)
# Connecting to the specified worksheet
ws = wb[sheetname]
# Initliasing an empty list where the excel tables will be imported
# into
var_tables = []
# Importing table details from excel: Table_Name and Sheet_Range
for table in ws._tables:
sht_range = ws[table.ref]
data_rows = []
i = 0
j = 0
for row in sht_range:
j += 1
data_cols = []
for cell in row:
i += 1
data_cols.append(cell.value)
if (i == len(row)) & (j == 1):
data_cols.append('Table_Name')
elif i == len(row):
data_cols.append(table.name)
data_rows.append(data_cols)
i = 0
var_tables.append(data_rows)
# Creating an empty list where all the ifs will be appended
# into
var_df = []
# Appending each table extracted from excel into the list
for tb in var_tables:
df = pd.DataFrame(tb[1:], columns=tb[0])
var_df.append(df)
# Merging all in one big df
df = pd.concat(var_df,axis=1) # This merges on columns

How can compare two excel files for checking the format in python?

I have one excel sheet with right format(Certain number of headers and specific names). Here I have another excel sheet and I have to check this excel sheet for right format or not(have to be the same number of header and same header names, no issue if the values below header will changed.). how can solve this issue ? NLP or any other suitable method is there?
If you have to compare two Excel you could try something like this (I add also some example Excels):
def areHeaderExcelEqual(excel1, excel2) :
equals = True
if len(excel1.columns) != len(excel2.columns):
return False
for i in range(len(excel1.columns)):
if excel1.columns[i] != excel2.columns[i] :
equals = False
return equals
And that's an application:
import pandas as pd
#create first example Excel
df_out = pd.DataFrame([('string1',1),('string2',2), ('string3',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp1.xlsx')
#create second example Excel
df_out = pd.DataFrame([('string5',1),('string2',5), ('string2',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp2.xlsx')
# create third example Excel
df_out = pd.DataFrame([('string1',1),('string4',2), ('string3',3)], columns=['MyName', 'MyValue'])
df_out.to_excel('tmp3.xlsx')
excel1 = pd.read_excel('tmp1.xlsx')
excel2 = pd.read_excel('tmp2.xlsx')
excel3 = pd.read_excel('tmp3.xlsx')
print(areHeaderExcelEqual(excel1, excel2))
print(areHeaderExcelEqual(excel1, excel3))
Note: Excel's files are provided just to see the different outputs.
For example, excel1 looks like this:
The idea is the same for the other files. To have more insights, see How to create dataframes.
Here's you're code:
f1 = pd.read_excel('file1.xlsx')
f2 = pd.read_excel('file2.xlsx')
print(areHeaderExcelEqual(f1, f2))
You can use pandas for that comparison.
import pandas as pd
f1 = pd.read_excel('sheet1.xlsx')
f2 = pd.read_excel('sheet2.xlsx')
header_threshold = 5 # any number of headers
print(len(f1.columns) == header_threshold)
print(f1.columns) # get the column names as values

DataFrame Split On Rows and apply on header one column using Python Pandas

I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.

Categories