Removing duplicates from Pandas dataFrame with condition for retaining original

Removing duplicates from Pandas dataFrame with condition for retaining original - python

Assuming I have the following DataFrame:
A | B
1 | Ms
1 | PhD
2 | Ms
2 | Bs
I want to remove the duplicate rows with respect to column A, and I want to retain the row with value 'PhD' in column B as the original, if I don't find a 'PhD', I want to retain the row with 'Bs' in column B.
I am trying to use
df.drop_duplicates('A')
with a condition

Consider using Categoricals. They're a nice was to group / order text non-alphabetically (among other things.)
import pandas as pd
#create a pandas dataframe for testing with two columns A integer and B string
df = pd.DataFrame([(1, 'Ms'), (1, 'PhD'),
(2, 'Ms'), (2, 'Bs'),
(3, 'PhD'), (3, 'Bs'),
(4, 'Ms'), (4, 'PhD'), (4, 'Bs')],
columns=['A', 'B'])
print("Original data")
print(df)
# force the column's string column B to type 'category'
df['B'] = df['B'].astype('category')
# define the valid categories:
df['B'] = df['B'].cat.set_categories(['PhD', 'Bs', 'Ms'], ordered=True)
#pandas dataframe sort_values to inflicts order on your categories
df.sort_values(['A', 'B'], inplace=True, ascending=True)
print("Now sorted by custom categories (PhD > Bs > Ms)")
print(df)
# dropping duplicates keeps first
df_unique = df.drop_duplicates('A')
print("Keep the highest value category given duplicate integer group")
print(df_unique)
Prints:
Original data
A B
0 1 Ms
1 1 PhD
2 2 Ms
3 2 Bs
4 3 PhD
5 3 Bs
6 4 Ms
7 4 PhD
8 4 Bs
Now sorted by custom categories (PhD > Bs > Ms)
A B
1 1 PhD
0 1 Ms
3 2 Bs
2 2 Ms
4 3 PhD
5 3 Bs
7 4 PhD
8 4 Bs
6 4 Ms
Keep the highest value category given duplicate integer group
A B
1 1 PhD
3 2 Bs
4 3 PhD
7 4 PhD

>>> df
A B
0 1 Ms
1 1 Ms
2 1 Ms
3 1 Ms
4 1 PhD
5 2 Ms
6 2 Ms
7 2 Bs
8 2 PhD
Sorting a dataframe with a custom function:
def sort_df(df, column_idx, key):
'''Takes a dataframe, a column index and a custom function for sorting,
returns a dataframe sorted by that column using that function'''
col = df.ix[:,column_idx]
df = df.ix[[i[1] for i in sorted(zip(col,range(len(col))), key=key)]]
return df
Our function for sorting:
cmp = lambda x:2 if 'PhD' in x else 1 if 'Bs' in x else 0
In action:
sort_df(df,'B',cmp).drop_duplicates('A', take_last=True)
P.S. in modern pandas versions there is no option take_last, use keep instead - see the doc.
A B
4 1 PhD
8 2 PhD

Assuming uniqueness of B value given A value, and that each A value has a row with Bs in the B column:
df2 = df[df['B']=="PhD"]
will give you a dataframe with the PhD rows you want.
Then remove all the PhD and Ms from df:
df = df[df['B']=="Bs"]
Then concatenate df and df2:
df3 = concat([df2, df])
Then you can use drop_duplicates like you wanted:
df3.drop_duplicates('A', inplace=True)

Remove duplicates retain original:
Sort your columns to put the one you want to keep on the top, then drop_duplicates does the right thing.
import pandas as pd
df = pd.DataFrame([(1, '2022-01-25'),
(1, '2022-05-25'),
(2, '2021-12-20'),
(2, '2021-11-20'),
(3, '2020-03-03'),
(3, '2020-03-04'),
(4, '2019-07-06'),
(4, '2019-07-07'),
(4, '2019-07-05')], columns=['A', 'B'])
print("Original data")
print(df.to_string(index=False) )
#Sort your dataframe so that the one you want is on the top:
df.sort_values(['A', 'B'], inplace=True, ascending=True)
print("custom sort")
print(df.to_string(index=False) )
# dropping duplicates this way keeps first
df_unique = df.drop_duplicates('A')
print("Keep first")
print(df_unique.to_string(index=False) )
Prints:
Original data
A B
1 2022-01-25
1 2022-05-25
2 2021-12-20
2 2021-11-20
3 2020-03-03
3 2020-03-04
4 2019-07-06
4 2019-07-07
4 2019-07-05
custom sort
A B
1 2022-01-25
1 2022-05-25
2 2021-11-20
2 2021-12-20
3 2020-03-03
3 2020-03-04
4 2019-07-05
4 2019-07-06
4 2019-07-07
Keep first
A B
1 2022-01-25
2 2021-11-20
3 2020-03-03
4 2019-07-05

Related

Sort Dataframe by Descending Rows AND Columns at the Same Time

Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.

You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1

Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2

Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)

Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1

Python pandas: Append rows of DataFrame and delete the appended rows

import pandas as pd
df = pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9,10,11],
'text': ['abc','zxc','qwe','asf','efe','ert','poi','wer','eer','poy','wqr']})
I have a DataFrame with columns:
id text
1 abc
2 zxc
3 qwe
4 asf
5 efe
6 ert
7 poi
8 wer
9 eer
10 poy
11 wqr
I have a list L = [1,3,6,10] which contains list of id's.
I am trying to append the text column using a list such that, from my list first taking 1 and 3(first two values in a list) and appending text column in my DataFrame with id = 1 which has id's 2, then deleting rows with id column 2 similarly then taking 3 and 6 and then appending text column where id = 4,5 to id 3 and then delete rows with id = 4 and 5 and iteratively for elements in list (x, x+1)
My final output would look like this:
id text
1 abczxc # joining id 1 and 2
3 qweasfefe # joining id 3,4 and 5
6 ertpoiwereer # joining id 6,7,8,9
10 poywqr # joining id 10 and 11

You can use isin with cumsum for Series, which is use for groupby with apply join function:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
df1 = df.groupby(s)['text'].apply(''.join).reset_index()
print (df1)
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr
It working because:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
print (s)
0 1
1 1
2 3
3 3
4 3
5 6
6 6
7 6
8 6
9 10
10 10
Name: id, dtype: int32

I changed the values not in list to np.nan and then ffill and groupby. Though #Jezrael's approach is much better. I need to remember to use cumsum:)
l = [1,3,6,10]
df.id[~df.id.isin(l)] = np.nan
df = df.ffill().groupby('id').sum()
text
id
1.0 abczxc
3.0 qweasfefe
6.0 ertpoiwereer
10.0 poywqr

Use pd.cut to create you bins then groupby with a lambda function to join your text in that group.
df.groupby(pd.cut(df.id,L+[np.inf],right=False, labels=[i for i in L])).apply(lambda x: ''.join(x.text))
EDIT:
(df.groupby(pd.cut(df.id,L+[np.inf],
right=False,
labels=[i for i in L]))
.apply(lambda x: ''.join(x.text)).reset_index().rename(columns={0:'text'}))
Output:
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr

In pandas Dataframe with multiindex how can I filter by order?

Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?

You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

pandas groupby, then sort within groups

I want to group my dataframe by two columns and then sort the aggregated results within those groups.
In [167]: df
Out[167]:
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
In [168]: df.groupby(['job','source']).agg({'count':sum})
Out[168]:
count
job source
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
I would now like to sort the 'count' column in descending order within each of the groups, and then take only the top three rows. To get something like:
count
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4

You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.
In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)
Out[35]:
count job source
4 7 sales E
2 6 sales C
1 4 sales B
5 5 market A
8 4 market D
6 3 market B

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.
Starting from the result of the first groupby:
In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})
We group by the first level of the index:
In [63]: g = df_agg['count'].groupby('job', group_keys=False)
Then we want to sort ('order') each group and take the first three elements:
In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))
However, for this, there is a shortcut function to do this, nlargest:
In [65]: g.nlargest(3)
Out[65]:
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4
dtype: int64
So in one go, this looks like:
df_agg['count'].groupby('job', group_keys=False).nlargest(3)

Here's other example of taking top 3 on sorted order, and sorting within the groups:
In [43]: import pandas as pd
In [44]: df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})
In [45]: df
Out[45]:
count_1 count_2 name
0 5 100 Foo
1 10 150 Foo
2 12 100 Baar
3 15 25 Foo
4 20 250 Baar
5 25 300 Foo
6 30 400 Baar
7 35 500 Baar
### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)
Out[46]:
name
Baar 7 35
6 30
4 20
Foo 5 25
3 15
1 10
dtype: int64
### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]:
count_1 count_2 name
0 35 500 Baar
1 30 400 Baar
2 20 250 Baar
3 12 100 Baar
4 25 300 Foo
5 15 25 Foo
6 10 150 Foo
7 5 100 Foo

Try this Instead, which is a simple way to do groupby and sorting in descending order:
df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)

If you don't need to sum a column, then use #tvashtar's answer. If you do need to sum, then you can use #joris' answer or this one which is very similar to it.
df.groupby(['job']).apply(lambda x: (x.groupby('source')
.sum()
.sort_values('count', ascending=False))
.head(3))

When grouped dataframe contains more than one grouped column ("multi-index"), using other methods erases other columns:
edf = pd.DataFrame({"job":["sales", "sales", "sales", "sales", "sales",
"market", "market", "market", "market", "market"],
"source":["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
"count":[2, 4,6,3,7,5,3,2,4,1],
"other_col":[1,2,3,4,56,6,3,4,6,11]})
gdf = edf.groupby(["job", "source"]).agg({"count":sum, "other_col":np.mean})
gdf.groupby(level=0, group_keys=False).apply(lambda g:g.sort_values("count", ascending=False))
This keeps other_col as well as ordering by count column within each group

I was getting this error without using "by":
TypeError: sort_values() missing 1 required positional argument: 'by'
So, I changed it to this and now it's working:
df.groupby(['job','source']).agg({'count':sum}).sort_values(by='count',ascending=False).head(20)

You can do it in one line -
df.groupby(['job']).apply(lambda x: x.sort_values(['count'], ascending=False).head(3)
.drop('job', axis=1))
what apply() does is that it takes each group of groupby and assigns it to the x in lambda function.

#joris answer helped a lot.
This is what worked for me.
df.groupby(['job'])['count'].nlargest(3)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.