Drop duplicate rows, but keep the union of their data

Drop duplicate rows, but keep the union of their data - python

I have a data frame like this:
pd.DataFrame([
[1, None, 'a'],
[1, 3.3, None],
[2, 1.7, 'c']
], columns=['unique_id', 'x', 'target'])
I want to drop one of the rows where unique_id is 1, but take the union of their values. That is, I want to produce this:
pd.DataFrame([
[1, 3.3, 'a'],
[2, 1.7, 'c']
], columns=['unique_id', 'x', 'target'])
Can this be done efficiently in Pandas?
Assume this data frame has between 10k and 100k rows, with maybe 10% being duplicates I want to eliminate. There will only be 2 or 3 duplicates of each unique_id.
Edit: when both rows have disagreeing entries, just taking the first one is fine in my case. But I'm open to solutions where, e.g. both values are collected in a list.

This gives the result for your example. It takes the first non-Nan value for each column, in each group.
df.groupby("unique_id", as_index=False).first()

Use groupby and first:
df.groupby('unique_id').first()

Related

How do I select the 3 columns with highest values from a row in a Pandas dataframe?

So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3, 3, 2, 1], [4, 3, 6, 6 ,3 ,4], [7, 2, 9, 9, 2, 7]]),
columns=['a', 'b', 'c', 'a_select','b_select','c_select'])
df
Now, I may need to reorganize the dataframe (or use two) to accomplish this, but...
I'd like to select the 2 largest values from each '_select' column per row, then use that to mean the corresponding column.
For example, row 1 would mean the values from a & b, row 2 a & c (NOT the values from the _select columns that we're looking at).
Currently I'm just iterating each row - as that seems rather simple, but slow with a large dataset - however I can't figure out how to use an apply or lambda function to do the equivelant (or if it's even possible).

Simple oneliner using nlargest
>>> df.filter(like='select').apply(lambda s: s.nlargest(2), 1).mean(1)
For performance, maybe numpy is useful:
>>> np.sort(df.filter(like='select').to_numpy(), 1)[:, -2:].mean(1)
To get values from the first columns, use argsort
>>> arr = df.filter(like='select').to_numpy()
>>> df[['a', 'b', 'c']].to_numpy()[[[x] for x in np.arange(len(arr))],
np.argsort(arr, 1)][:, -2:].mean(1)
array([1.5, 5. , 8. ])

Pandas : How to drop a row where column values match with a specific value (all value are list of value)

I have a dataframe column where all the values are under a list format (one list per column value with one or multiple items).
I want to delete rows where a specific string is found in these list (the column value can be a 5 items list, if one of the item match with a specific string, then the row has to be dropped)
for row in df:
for count, item in enumerate(df["prescript"]):
for element in item:
if "complementary" in element:
df.drop(row)
df["prescript"] is the column on which i want to iterate
"complementary" : if that word is find in column value, the row has to be dropped
How can i improve the code above to make it works?
Thanks all

Impractical solution that may trigger some new learning:
df = pd.DataFrame(
columns=" index drug prescript ".split(),
data= [
[ 0, 1, ['a', 's', 'd', 'f'], ],
[ 1, 2, ['e', 'a', 'e', 'f'], ],
[ 2, 3, ['e', 'a'], ],
[ 3, 4, ['a', 'complementary'], ],]).set_index("index", drop=True)
df.loc[
df['prescript'].explode().replace({'complementary': np.nan}).groupby(level=0).agg(lambda x: ~pd.isnull(x).any())
]

Just mask first the rows which contain the word using Series.apply
word = "complementary"
word_is_in = df["prescript"].apply(lambda list_item: word in list_item)
Then use boolean indexing to select only the rows which don't contain the word by inverting the boolean Series word_is_in
df = df[~word_is_in]

Merge columns into one while dropping nan values and duplicates

I am trying to merge multiple columns into a single column while dropping duplicates and dropping null values but keeping the rows.
What I have:
df= pd.DataFrame(np.array([['nan', 'nan', 'nan'], ['nan', 2, 2], ['nan', 'x', 'nan']]), columns=['a', 'b', 'c'])
What I need:
df= pd.DataFrame(np.array([[''], [ 2], [ 1]]), columns=['a'])
I have tried this but I get 1,nan for the last row:
df['a]=df[['a','b','c]].agg(', '.join, axis=1)
I have also tried the following but I cannot get this to work:
.stack().unstack()
and
.join
but I cannot get these to drop duplicates for each row

This will find the maximum value of a row and replace 'nan' with '':
new_df = pd.DataFrame(df.astype(float).max(axis=1).replace(np.nan, ''), columns=[df.columns[0]])
output:
a
0
1 2.0
2 1.0

Merge (or concat) two dataframes by index with duplicates index

I have two dataframe A and B with common indexes for A and B. These common indexes can appear several times (duplicate) for A and B.
I want to merge A and B in these 3 ways :
Case 0: If index i of A appear one time (i1) and index i for B
appear one times (i1), I want my merged by index dataframe to add
the rows A(i1), B(i1)
Case 1 : If index i of A appear one time (i1) and index i for B
appear two times in this order : (i1 and i2), I want my merged by
index dataframe to add the rows A(i1), B(i1) and A(i1), B(i2).
Case 2: If index i of A appear two time in this order : (i1, i2) and
index i for B appear two times in this order : (i1 and i2), I want
my merged by index dataframe to add the rows A(i1), B(i1) and A(i2),
B(i2).
These 3 cases are all of the possible case that could appear on my data.
When using pandas.merge, case 0 and case 1 works. But for case 2, the returned dataframe will add rows A(i1), B(i1) and A(i1), B(i2) and A(i2), B(i1) and A(i2), B(i2) instead of A(i1), B(i1) and A(i2), B(i2).
I could use pandas.merge method and then delete the undesired merged rows but is there a ways to combine these 3 cases at the same time ?
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
pd.merge(A,B, left_index=True, right_index=True, how='inner')
For example, in the dataframe above, I want exactly it without the second and third index 'a'.

Basically, your 3 cases can be summarized into 2 cases:
Index i occur the same times (1 or 2 times) in A and B, merge according to the order.
Index i occur 2 times in A and 1 time in B, merge using B content for all rows.
Prep code:
def add_secondary_index(df):
df.index.name = 'Old'
df['Order'] = df.groupby(df.index).cumcount()
df.set_index('Order', append=True, inplace=True)
return df
import pandas as pd
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
index_times = A.groupby(A.index).count() == B.groupby(B.index).count()
For case 1 is easy to solve, you just need to add the secondary index:
same_times_index = index_times[index_times[0].values].index
A_same = A.loc[same_times_index].copy()
B_same = B.loc[same_times_index].copy()
add_secondary_index(A_same)
add_secondary_index(B_same)
result_merge_same = pd.merge(A_same,B_same,left_index=True,right_index=True)
For case 2, you need to seprately consider:
not_same_times_index = index_times[~index_times.index.isin(same_times_index)].index
A_notsame = A.loc[not_same_times_index].copy()
B_notsame = B.loc[not_same_times_index].copy()
result_merge_notsame = pd.merge(A_notsame,B_notsame,left_index=True,right_index=True)
You could consider whether to add secondary index for result_merge_notsame, or drop it for result_merge_same.

What's the most efficient way to get a variable length of rows w.r.t each group of a dataframe

To illustrate my question clearly, for a dummy dataframe like this:
df = pd.DataFrame({'X' : ['B', 'B', 'A', 'A', 'A'], 'Y' : [1, 2, 3, 4, 5]})
How can I get top 1 row of group A and top 2 rows of group B, and get rid of the rest rows of each group? By the way, the real dataset is big with hundreds of thousands of rows and thousands of groups.
And the output looks like this:
pd.DataFrame({'X' : ['B', 'B', 'A'], 'Y' : [1, 2, 3]})
My main gripe is .groupby().head() only gives me a fixed length of rows within each group, and I want have a different number of rows of different groups.

One way to do this is create a dictionary contains the number of rows each group should keep, and in the groupby.apply, use the g.name as the key to look up the value in the dictionary, with the head method you can keep different rows for each group:
rows_per_group = {"A": 1, "B": 2}
df.groupby("X", group_keys=False).apply(lambda g: g.head(rows_per_group[g.name]))
# X Y
#2 A 3
#0 B 1
#1 B 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop duplicate rows, but keep the union of their data - python

This gives the result for your example. It takes the first non-Nan value for each column, in each group. df.groupby("unique_id", as_index=False).first()

Use groupby and first: df.groupby('unique_id').first()

Related

How do I select the 3 columns with highest values from a row in a Pandas dataframe?

Pandas : How to drop a row where column values match with a specific value (all value are list of value)

Merge columns into one while dropping nan values and duplicates

Merge (or concat) two dataframes by index with duplicates index

What's the most efficient way to get a variable length of rows w.r.t each group of a dataframe

Categories

Resources