ValueError: cannot reindex from a duplicate axis using isin with pandas - python

I am trying to short zipcodes into various files but I keep getting
ValueError: cannot reindex from a duplicate axis
I've read through other documentation on Stackoverflow, but I haven't been about to figure out why its duplicating axis.
import csv
import pandas as pd
from pandas import DataFrame as df
fp = '/Users/User/Development/zipcodes/file.csv'
file1 = open(fp, 'rb').read()
df = pd.read_csv(fp, sep=',')
df = df[['VIN', 'Reg Name', 'Reg Address', 'Reg City', 'Reg ST', 'ZIP',
'ZIP', 'Catagory', 'Phone', 'First Name', 'Last Name', 'Reg NFS',
'MGVW', 'Make', 'Veh Model','E Mfr', 'Engine Model', 'CY2010',
'CY2011', 'CY2012', 'CY2013', 'CY2014', 'CY2015', 'Std Cnt',
]]
#reader.head(1)
df.head(1)
zipBlue = [65355, 65350, 65345, 65326, 65335, 64788, 64780, 64777, 64743,
64742, 64739, 64735, 64723, 64722, 64720]
Also contains zipGreen, zipRed, zipYellow, ipLightBlue
But did not include in example.
def IsInSort():
blue = df[df.ZIP.isin(zipBlue)]
green = df[df.ZIP.isin(zipGreen)]
red = df[df.ZIP.isin(zipRed)]
yellow = df[df.ZIP.isin(zipYellow)]
LightBlue = df[df.ZIP.isin(zipLightBlue)]
def SaveSortedZips():
blue.to_csv('sortedBlue.csv')
green.to_csv('sortedGreen.csv')
red.to_csv('sortedRed.csv')
yellow.to_csv('sortedYellow.csv')
LightBlue.to_csv('SortedLightBlue.csv')
IsInSort()
SaveSortedZips()
1864 # trying to reindex on an axis with duplicates 1865
if not self.is_unique and len(indexer):
-> 1866 raise ValueError("cannot reindex from a duplicate axis") 1867 1868 def reindex(self, target, method=None,
level=None, limit=None):
ValueError: cannot reindex from a duplicate axis

I'm pretty sure your problem is related to your mask
df = df[['VIN', 'Reg Name', 'Reg Address', 'Reg City', 'Reg ST', 'ZIP',
'ZIP', 'Catagory', 'Phone', 'First Name', 'Last Name', 'Reg NFS',
'MGVW', 'Make', 'Veh Model','E Mfr', 'Engine Model', 'CY2010',
'CY2011', 'CY2012', 'CY2013', 'CY2014', 'CY2015', 'Std Cnt',
]]
'ZIP' is in there twice. Removing one of them should solve the problem.
The error ValueError: cannot reindex from a duplicate axis is one of these very very cryptic pandas errors which simply does not tell you what the error is.
The error is often related to two columns being named the same either before or after (internally in) the operation.

Related

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Getting KeyError while merging dataframe even though columns are correct

I am trying to merge two dataframes but keep getting KeyError.
I checked column names I used and it looks fine. I even trimmed the col names so that there is no leading or trailing space. Still I get the same error.
I have no clue why is it failing.
Can someone help me with this? I went through lot of posts here and in other sites but none seems to fix my issue.
This is the merge statement:
roaster_ilc_mrg = (pd.merge(roaster,ilcdata,left_on="Emp ID", right_on="Emp Serial"))
Here is the cols from both df:
roaster Columns: Index(['Emp ID', 'XID', 'Name', 'Team', 'Location', 'Site', 'Status'], dtype='object')
ilcdata Columns: Index(['Activity Code', 'Billing Code', 'Emp Serial', 'Emp Lastname',
'Emp Manager', 'Weekending Date', 'Total hours'],
dtype='object')
Error I see:
File "C:\Abraham\Python\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1563, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'Emp Serial'
ilcdata.head() data
if I do below check for any column lables in ilcdata dataframe, i get default value ie, 'No col'
print(ilcdata.get('Emp Serial', default='No col'))
But all those labels are present there...Its driving me crazy coz I have used similar merge before and it was working smoothly

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported

I have a data frame with the following columns:
job_post.columns
Index(['Job.ID_list', 'Provider', 'Status', 'Slug', 'Title', 'Position',
'Company', 'City', 'State.Name', 'State.Code', 'Address', 'Latitude',
'Longitude', 'Industry', 'Job.Description', 'Requirements', 'Salary',
'Listing.Start', 'Listing.End', 'Employment.Type', 'Education.Required',
'Created.At', 'Updated.At', 'Job.ID_desc', 'text'],
dtype='object')
I want to select only the following columns from the dataframe:
columns_job_post = ['Job.ID_listing', 'Slug', 'Position', 'Company', 'Industry', 'Job.Description','Employment.Type', 'Education.Required', 'text'] # columns to keep
However, I get the result:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported
I solved the issue by writing:
jobs_final = job_post.reindex(columns = columns_job_post)
Similarly, I have a data frame with the following columns:
cand_exp.columns
Index(['Applicant.ID', 'Position.Name', 'Employer.Name', 'City', 'State.Name',
'State.Code', 'Start.Date', 'End.Date', 'Job.Description', 'Salary',
'Can.Contact.Employer', 'Created.At', 'Updated.At'],
dtype='object')```
I also selected just some columns from the whole list using .loc but I didn't get the KeyError: Passing list-like...
columns_cand_exp = ['Applicant.ID', 'Position.Name', 'Employer.Name', 'Job.Description', 'Salary']``` # columns to keep
resumes_final = cand_exp.loc[:, columns_cand_exp]
What is the reason for this?
Thank you in advance!
Because in the first example you introduced column names that are not exists in the original data frame (ex: Job.ID_listing).
In the second example all the columns were already in the original data frame.
as the error says: 'Passing list-likes to .loc or [] with any missing labels .....

Formatting two columns with dollar signs during to_csv using pandas

I have a csv merge that has many columns. I am having trouble formatting price columns.I need to have them follow this format $1,000.00.Is there a function I can use to achieve this for just two columns (Sales Price and Payment Amount)? Here is my code so far:
df3 = pd.merge(df1, df2, how='left', on=['Org ID', 'Org Name'])
cols = ['Org Name', 'Org Type', 'Chapter', 'Join Date', 'Effective Date', 'Expire Date',
'Transaction Date', 'Product Name', 'Sales Price',
'Invoice Code', 'Payment Amount', 'Add Date']
df3 = df3[cols]
df3 = df3.fillna("-")
out_csv = root_out + "report-merged.csv"
df3.to_csv(out_csv, index=False)
A solution that I thought was going to work but I get an error (ValueError: Unknown format code 'f' for object of type 'str')
df3['Sales Price'] = df3['Sales Price'].map('${:,.2f}'.format)
Based on your error ("Unknown format code 'f' for object of type 'str'"), the columns that you are trying to format are being treated as strings. So using .astype(float) in the code below addresses this.
There is not a great way to set this formatting during (within) your to_csv call. However, in an intermediate line you could use:
cols = ['Sales Price', 'Payment Amount']
df3.loc[:, cols] = df3[cols].astype(float).applymap('${:,.2f}'.format)
Then call to_csv.

Pandas efficiently normalize column titles in a dataframe

I am trying to automatically read rows when loading in dataframe by automatically normalizing to one term. The following code works:
import pandas as pd
df=pd.read_csv('Test.csv', encoding = "ISO-8859-1", index_col=0)
firstCol=['FirstName','First Name','Nombre','NameFirst', 'Name', 'Given name', 'given name', 'Name']
df.rename(columns={typo: 'First_Name' for typo in firstCol}, inplace=True)
addressCol=['Residence','Primary Address', 'primary address' ]
df.rename(columns={typo: 'Address' for typo in addressCol}, inplace=True)
computerCol=['Laptop','Desktop', 'server', 'mobile' ]
df.rename(columns={typo: 'Address' for typo in computerCol}, inplace=True)
Is there a more efficient way of looping or rewriting it so it is less redundant?
The only way I can think of is to just reduce it to one df.rename op, by building a complete dictionary once off, eg:
replacements = {
'Name': ['FirstName','First Name','Nombre','NameFirst', 'Name', 'Given name', 'given name', 'Name'],
'Address': ['Residence','Primary Address', 'primary address' ],
#...
}
df.rename(columns={el:k for k,v in replacements.iteritems() for el in v}, inplace=True)
So it should be more efficient as to function call overhead, but I'd personally view it as more readable by having a dict of keys, which are the "to" values, with the values being the "from"'s to replace.

Categories