Removing duplictes appearing in two or more columns Python

Removing duplictes appearing in two or more columns Python - python

Same problem but it didn't help. How To Solve KeyError: u"None of [Index([..], dtype='object')] are in the [columns]"
First try:
df = pd.read_csv('ABCD.csv', index_col=['A'])
df=df.drop_duplicates(['A'],['B'])
KeyError: Index(['Sample_ID'], dtype='object')
Here I have found out that it impossible to removed the index itself so I removed it from the top:
df = pd.read_csv('ABCD.csv')
df=df.drop_duplicates(['A'],['B'],keep = 'first')
TypeError: drop_duplicates() got multiple values for argument 'keep'
When I print df(type) it posts "DataFrame" , what could be the problem?

I thought that would be
df=df.drop_duplicates(['A', 'B'],keep = 'first')
instead of:
df=df.drop_duplicates(['A'],['B'],keep = 'first')
The subset must be a list of columns, not separate to multiple arguments:
subsetcolumn label or sequence of labels, optional doc
PS: You should use df.drop_duplicates(['A', 'B'], keep='first', inplace=True), you dont need to assign back to df when adding inplace

Related

Pandas set column names by position

I have the following code:
df1 = pd.read_excel(f, sheet_name=0, header=6)
# Drop Columns by position
df1 = df1.drop([df1.columns[5],df1.columns[8],df1.columns[10],df1.columns[14],df1.columns[15],df1.columns[16],df1.columns[17],df1.columns[18],df1.columns[19],df1.columns[21],df1.columns[22],df1.columns[23],df1.columns[24],df1.columns[25]], axis=1)
# rename cols
This is where I am struggling, as each time I attempt to rename the cols by position it returns "None" which is a <class 'NoneType'> ( when I use print(type(df1)) ). Note that df1 returns the dataframe as expected after dropping the columns
I get this with everything I have tried below:
column_indices = [0,1,2,3,4,5,6,7,8,9,10,11]
new_names = ['AWG Item Code','Description','UPC','PK','Size','Regular Case Cost','Unit Scan','AMAP','Case Bill Back','Monday Start Date','Sunday End Date','Net Unit']
old_names = df1.columns[column_indices]
df1 = df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
And with:
df1 = df1.rename({df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
When I remove the inplace=True essentially setting it to false, it returns the dataframe but without any of the changes I am wanting.
The tricky part is that in this program my column headers will change each time, but the columns the data is in will not. Otherwise I would just use df = df.rename(columns=["a":"newname"])

One simpler version of your code could be :
df1.columns = new_names
It should work as intended, i.e. renaming columns in the index order.
Otherwise, in your own code : if you print df1.columns[column_indices]
You do not get a list but a pandas.core.indexes.base.Index
So to correct your code you just need to change the 2 last lines by :
old_names = df1.columns[column_indices].tolist()
df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
Have a nice day

I was dumb and missing columns=
df1.rename(columns={df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
works fine

I am not sure whether this answers your question:
There is a simple way to rename the columns:
If I have a data frame: say df1. I can see the columns name using the following code:
df.columns.to_list()
which gives me suppose following columns name:
['A', 'B', 'C','D']
And I want to keep the first three columns and rename them as 'E', 'F' and 'G' respectively. The following code gives me the desired outcome:
df = df[['A','B','C']]
df.columns = ['E','F','G]
new outcome:
df.columns.to_list()
output: ['E','F','G']

Groupby multiple columns & Sum - Create new column with added If Condition

I need to groupby multiple columns & then get Sum in a new column with added If condition. I tried the next code and it worked great with grouping by single column:
df['new column'] = (
df['value'].where(df['value'] > 0).groupby(df['column1']).transform('sum')
)
However, when I try to group by multiple columns I get an error.
df['new_column'] = (
df['value'].where(df['value'] > 0).groupby(df['column1', 'column2']).transform('sum')
)
Error:
->return self._engine.get_loc(casted_key)
The above exception was the direct cause of the following exception:
->indexer = self.columns.get_loc(key)
->raise KeyError(key) from err
->if is_scalar(key) and isna(key) and not self.hasnans: ('column1', 'column2')
Could you please advise how I should change the code to get the same result but grouping by multiple columns?
Thank you

Cause of error
The syntax to select multiple columns df['column1', 'column2'] is wrong. This should be df[['column1', 'column2']]
Even if you use df[['column1', 'column2']] for groupby, pandas will raise another error complaining that the grouper should be one dimensional. This is because df[['column1', 'column2']] returns a dataframe which is a two dimensional object.
How to fix the error?
Hard way:
Pass each of the grouping columns as one dimensional series to groupby
df['new_column'] = (
df['value']
.where(df['value'] > 0)
.groupby([df['column1'], df['column2']]) # Notice the change
.transform('sum')
)
Easy way:
First assign the masked column values to the target column, then do groupby + transform as you would normally do
df['new_column'] = df['value'].where(df['value'] > 0)
df['new_column'] = df.groupby(['column1', 'column2'])['new_column'].transform('sum')

Renaming pandas columns gives not found in index error

I have a data frame called v where columns are = ['self','id','desc','name','arch','rel']. And when I rename is as follows it won't let me drop columns giving column not found in axis error.
case1:
for i in range(0,len(v.columns)):
#I'm trying to add 'v_' prefix to all col names
v.columns.values[i] = 'v_' + v.columns.values[i]
v.drop('v_self',1)
#leads to error
KeyError: "['v_self'] not found in axis"
But if I do it as follows then it works fine
case2:
v.columns = ['v_self','v_id','v_desc','v_name','v_arch','v_rel']
v.drop('v_self',1)
# no error
In both cases if I do following it give same results for its columns
v.columns
#both cases gives
Index(['v_self', 'v_id', 'v_description', 'v_name', 'v_archived',
'v_released'],
dtype='object')
I can't understand why in the case1 it gives an error? Please help, thanks.

That's because .values returns the underlying values. You're not supposed to change those directly. Assigning directly to .columns is supported though.
Try something like this:
import pandas
df = pandas.DataFrame(
[
{key: 0 for key in ["self", "id", "desc", "name", "arch", "rel"]}
for _ in range(100)
]
)
# Add a v_ to every column
df.columns = [f"v_{column}" for column in df.columns]
# Drop one column
df = df.drop(columns=["v_self"])

To your "case 1":
You meet a bug (#38547) in pandas — “Direct renaming of 1 column seems to be accepted, but only old name is working”.
It means that after that "renaming", you may delete the first column
not by using
v.drop('v_self',1)
but using the old name
v.drop('self',1)`.
Of course, the better option is not using such a buggy renaming in the
current versions of pandas.
To renaming columns by adding a prefix to every label:
There is a direct dateframe method .add_prefix() for it, isn't it?
v = df.add_prefix("v_")

KeyError in Pandas sort_values

I am trying to sort a dataframe by a particular column: "Lat". However, although when I print out the column names, "Lat" clearly shows up, when I try to use it as the "by" parameter in the sort_values function, I get a KeyError. It doesn't matter which column name I use, I get a key error no matter what.
I have tried using different columns, running in place, stripping the columns names, nothing seems to work
print(lights_df.columns.tolist())
lights_by_lat = lights_df.sort_values(axis = 'columns', by = "Lat", kind
= "mergesort")
outputs:
['the_geom', 'OBJECTID', 'TYPE', 'Lat', 'Long']
KeyError: 'Lat'
^output from trying to sort

All you have to do is remove the axis argument:
lights_by_lat = lights_df.sort_values(by = "Lat", kind = "mergesort")
and you should be good.

Replace NaN values of filtered column by the mean

I have a dataframe with the following shape:
Index([u'PRODUCT',u'RANK', u'PRICE', u'STARS', u'SNAPDATE', u'CAT_NAME'], dtype='object')
For each product of that dataframe I can have NaN values for a specific date.
The goal is to replace for each product the NaN values by the mean of the existing values.
Here is what I tried without success:
for product in df['PRODUCT'].unique():
df = df[df['PRODUCT'] == product]['RANK'].fillna((df[df['PRODUCT'] == product]['RANK'].mean()), inplace=True)
print df
gives me:
TypeError: 'NoneType' object has no attribute '__getitem__'
What am I doing wrong?

You can use groupby to create a mean series:
s = df.groupby('PRODUCT')['RANK'].mean()
Then use this series to fillna values:
df['RANK'] = df['RANK'].fillna(df['PRODUCT'].map(s))

The reason you're getting this error is because of your use of inplace in fillna. Unfortunately, the documentation there is wrong:
Returns: filled : Series
This shows otherwise, though:
df = pd.DataFrame({'a': [3]})
>>> type(df.a.fillna(6, inplace=True))
NoneType
>>> type(df.a.fillna(6))
pandas.core.series.Series
So when you assign
df = df[df['PRODUCT'] == product]['RANK'].fillna((df[df['PRODUCT'] == product]['RANK'].mean()), inplace=True)
you're assigning df = None, and the next iteration fails with the error you get.
You can omit the assignment df =, or, better yet, use the other answer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplictes appearing in two or more columns Python - python

Related

Pandas set column names by position

Groupby multiple columns & Sum - Create new column with added If Condition

Renaming pandas columns gives not found in index error

KeyError in Pandas sort_values

Replace NaN values of filtered column by the mean

Categories

Resources