I have a csv merge that has many columns. I am having trouble formatting price columns.I need to have them follow this format $1,000.00.Is there a function I can use to achieve this for just two columns (Sales Price and Payment Amount)? Here is my code so far:
df3 = pd.merge(df1, df2, how='left', on=['Org ID', 'Org Name'])
cols = ['Org Name', 'Org Type', 'Chapter', 'Join Date', 'Effective Date', 'Expire Date',
'Transaction Date', 'Product Name', 'Sales Price',
'Invoice Code', 'Payment Amount', 'Add Date']
df3 = df3[cols]
df3 = df3.fillna("-")
out_csv = root_out + "report-merged.csv"
df3.to_csv(out_csv, index=False)
A solution that I thought was going to work but I get an error (ValueError: Unknown format code 'f' for object of type 'str')
df3['Sales Price'] = df3['Sales Price'].map('${:,.2f}'.format)
Based on your error ("Unknown format code 'f' for object of type 'str'"), the columns that you are trying to format are being treated as strings. So using .astype(float) in the code below addresses this.
There is not a great way to set this formatting during (within) your to_csv call. However, in an intermediate line you could use:
cols = ['Sales Price', 'Payment Amount']
df3.loc[:, cols] = df3[cols].astype(float).applymap('${:,.2f}'.format)
Then call to_csv.
Related
I have a pandas dataframe (sample):
data = [['ABC', 'John', '123', 'Yes', '2022_Jan'], ['BCD', 'Amy', '456', 'Yes', '2022_Jan'], ['ABC', 'Michelle', '123', 'No', '2022_Feb'], ['CDE', 'John', '789', 'No', '2022_Feb'], ['ABC', 'Michelle', '012', 'Yes', '2022_Mar'], ['BCD', 'Amy', '123', 'No', '2022_Mar'], ['CDE', 'Jill', '789', 'No', '2022_Mar'], ['CDE', 'Jack', '789', 'No', '2022_Mar']]
tmp2 = pd.DataFrame(data, columns = ['Responsibility', 'Name', 'ID', 'Has Error', 'Year_Month'])
tmp3 = tmp2[['Responsibility', 'Name', 'ID', 'Has Error']]
The actual dataframe is a lot larger with more columns, but the above are the only fields I need right now. I already have the following code that generates a year-to-date table that groups by 'Responsibility' and 'Name' and gives me the number & % of unique 'ID's that have errors and don't have errors, and exports the table to a single Excel sheet:
result = pd.pivot_table(tmp3, index =['Responsibility', 'Name'], columns = ['Has Error'], aggfunc=len)
#cleanup
result.fillna(0, inplace=True)
result.columns = [s1 + "_" + str(s2) for (s1,s2) in result.columns.tolist()]
result = result.rename(columns={'ID_No': 'Does NOT Have Error (count)', 'ID_Yes': 'Has Error (count)'})
result = result.astype(int)
#create fields for %s and totals
result['Has Error (%)'] = round(result['Has Error (count)'] / (result['Has Error (count)'] + result['Does NOT Have Error (count)']) *100, 2).astype(str)+'%'
result['Does NOT Have Error (%)'] = round(result['Does NOT Have Error (count)'] / (result['Has Error (count)'] + result['Does NOT Have Error (count)']) *100, 2).astype(str) + '%'
result['Total Count'] = result['Has Error (count)'] + result['Does NOT Have Error (count)']
result = result.reindex(columns=['Has Error (%)', 'Does NOT Have Error (%)', 'Has Error (count)', 'Does NOT Have Error (count)', 'Total Count'])
#save to excel
Excelwriter = pd.ExcelWriter('./output/final.xlsx',engine='xlsxwriter')
workbook=Excelwriter.book
result.to_excel(Excelwriter,sheet_name='YTD Summary',startrow=0 , startcol=0)
Now, I want to keep this YTD summary sheet, and generate the 'result' table for data from each month (from the 'Year_Month' field in the original dataset tmp2), and export the same table with data for each month into separate Excel sheets within the same output file. I will be generating this entire output file on a recurring basis, so want to write the code so that when I read in a new dataframe, it will automatically identify each month available in the data, and generate separate tables for each month using the code that I've already written above, and export each table into separate tabs in the Excel file. I'm a beginner at Python and I'm finding this is harder to do than I originally thought and what I've tried so far is not working. I know one way to do this would be to use a for loop or matrix functions, but can't figure out how to make the code work. Any help would be greatly appreciated!
Assuming you don't care about the year, you can split the month from the last column and then iterate over groupby.
split the month
df['month'] = df['Year_Month'].str.split('_').str[1]
iterate with groupby
for month, df_month in df.groupby('month'):
# your processing stuff here
# 'df_month' is the sub-dataframe for one month
df_month_processed.to_excel(ExcelWriter, sheet_name=month, ...)
I have the below code to rename a column
df.rename(columns = {'Long Name 1':'Court'}, inplace = True)
But encounter the below error
KeyError: "['Long Name 1'] not in index"
Not sure why there is an error. When I see the columns in the df, it exists
print(df.columns)
Result:
Index(['Activity', 'Date', 'Hirer Category', 'No of Slots', 'Slot Status', 'Start Time', 'Court', 'Long Name 1'], dtype='object')
Why am I not able to rename column 'Long Name 1'?
Your Problem can't reproduce. I checked with dummy values but not found any error. You can see the screenshot and your code is working fine.
Link to the Screenshot as I not have enough reputation to embed it
Hope this helps. Thank you
Apologies, I didn't even know how to title/describe the issue I am having, so bear with me. I have the following code:
import pandas as pd
data = {'Invoice Number':[1279581, 1279581,1229422, 1229422, 1229422],
'Project Key':[263736, 263736, 259661, 259661, 259661],
'Project Type': ['Visibility', 'Culture', 'Spend', 'Visibility', 'Culture']}
df= pd.DataFrame(data)
How do I get the output to basically group the Invoice Numbers so that there is only 1 row per Invoice Number and combine the multiple Project Types (per that 1 Invoice) into 1 row?
Code and output for output is below.
Thanks much appreciated.
import pandas as pd
data = {'Invoice Number':[1279581,1229422],
'Project Key':[263736, 259661],
'Project Type': ['Visibility_Culture', 'Spend_Visibility_Culture']
}
output = pd.DataFrame(data)
output
>>> (df
.groupby(['Invoice Number', 'Project Key'])['Project Type']
.apply(lambda x: '_'.join(x))
.reset_index()
)
Invoice Number Project Key Project Type
0 1229422 259661 Spend_Visibility_Culture
1 1279581 263736 Visibility_Culture
I am trying to short zipcodes into various files but I keep getting
ValueError: cannot reindex from a duplicate axis
I've read through other documentation on Stackoverflow, but I haven't been about to figure out why its duplicating axis.
import csv
import pandas as pd
from pandas import DataFrame as df
fp = '/Users/User/Development/zipcodes/file.csv'
file1 = open(fp, 'rb').read()
df = pd.read_csv(fp, sep=',')
df = df[['VIN', 'Reg Name', 'Reg Address', 'Reg City', 'Reg ST', 'ZIP',
'ZIP', 'Catagory', 'Phone', 'First Name', 'Last Name', 'Reg NFS',
'MGVW', 'Make', 'Veh Model','E Mfr', 'Engine Model', 'CY2010',
'CY2011', 'CY2012', 'CY2013', 'CY2014', 'CY2015', 'Std Cnt',
]]
#reader.head(1)
df.head(1)
zipBlue = [65355, 65350, 65345, 65326, 65335, 64788, 64780, 64777, 64743,
64742, 64739, 64735, 64723, 64722, 64720]
Also contains zipGreen, zipRed, zipYellow, ipLightBlue
But did not include in example.
def IsInSort():
blue = df[df.ZIP.isin(zipBlue)]
green = df[df.ZIP.isin(zipGreen)]
red = df[df.ZIP.isin(zipRed)]
yellow = df[df.ZIP.isin(zipYellow)]
LightBlue = df[df.ZIP.isin(zipLightBlue)]
def SaveSortedZips():
blue.to_csv('sortedBlue.csv')
green.to_csv('sortedGreen.csv')
red.to_csv('sortedRed.csv')
yellow.to_csv('sortedYellow.csv')
LightBlue.to_csv('SortedLightBlue.csv')
IsInSort()
SaveSortedZips()
1864 # trying to reindex on an axis with duplicates 1865
if not self.is_unique and len(indexer):
-> 1866 raise ValueError("cannot reindex from a duplicate axis") 1867 1868 def reindex(self, target, method=None,
level=None, limit=None):
ValueError: cannot reindex from a duplicate axis
I'm pretty sure your problem is related to your mask
df = df[['VIN', 'Reg Name', 'Reg Address', 'Reg City', 'Reg ST', 'ZIP',
'ZIP', 'Catagory', 'Phone', 'First Name', 'Last Name', 'Reg NFS',
'MGVW', 'Make', 'Veh Model','E Mfr', 'Engine Model', 'CY2010',
'CY2011', 'CY2012', 'CY2013', 'CY2014', 'CY2015', 'Std Cnt',
]]
'ZIP' is in there twice. Removing one of them should solve the problem.
The error ValueError: cannot reindex from a duplicate axis is one of these very very cryptic pandas errors which simply does not tell you what the error is.
The error is often related to two columns being named the same either before or after (internally in) the operation.
I am trying to automatically read rows when loading in dataframe by automatically normalizing to one term. The following code works:
import pandas as pd
df=pd.read_csv('Test.csv', encoding = "ISO-8859-1", index_col=0)
firstCol=['FirstName','First Name','Nombre','NameFirst', 'Name', 'Given name', 'given name', 'Name']
df.rename(columns={typo: 'First_Name' for typo in firstCol}, inplace=True)
addressCol=['Residence','Primary Address', 'primary address' ]
df.rename(columns={typo: 'Address' for typo in addressCol}, inplace=True)
computerCol=['Laptop','Desktop', 'server', 'mobile' ]
df.rename(columns={typo: 'Address' for typo in computerCol}, inplace=True)
Is there a more efficient way of looping or rewriting it so it is less redundant?
The only way I can think of is to just reduce it to one df.rename op, by building a complete dictionary once off, eg:
replacements = {
'Name': ['FirstName','First Name','Nombre','NameFirst', 'Name', 'Given name', 'given name', 'Name'],
'Address': ['Residence','Primary Address', 'primary address' ],
#...
}
df.rename(columns={el:k for k,v in replacements.iteritems() for el in v}, inplace=True)
So it should be more efficient as to function call overhead, but I'd personally view it as more readable by having a dict of keys, which are the "to" values, with the values being the "from"'s to replace.