How can a duplicate row be dropped with some condition [duplicate]

How can a duplicate row be dropped with some condition [duplicate] - python

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.

Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d

df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function

Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d

Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d

Related

Find duplicate rows in one column and print duplicated rows to a new dataframe table as a group using python pandas

My objective is to get the duplicated groups of column A and print/extract them into a new dataframe, ultimately to print each new dataframe into csv.
my current dataframe:
column A
column B
A
2
A
2
A
3
B
2
B
3
B
4
C
2
C
2
D
2
D
2
D
3
desired output:
column A
column B
A
2
A
2
A
3
column A
column B
B
2
B
3
column A
column B
C
2
C
2
column A
column B
D
2
D
2
D
3

You can loop over the unique values of column A and can diplay the data with specific value of column A
Code:
[df[df['ColA']==i] for i in set(df.ColA.values)]
Output;
[ ColA ColB
0 A 2
1 A 2
2 A 3,
ColA ColB
6 C 2
7 C 2,
ColA ColB
3 B 2
4 B 3
5 B 4,
ColA ColB
8 D 2
9 D 2
10 D 3]

g = df.groupby('column A')
dup_chk = df.loc[df['column A'].eq('A'), 'column B']
out = [g.get_group(x)[lambda x: x['column B'].isin(dup_chk)] for x in g.groups]
out(list of dataframes)
[ column A column B
0 A 2
1 A 2
2 A 3,
column A column B
3 B 2
4 B 3,
column A column B
6 C 2
7 C 2,
column A column B
8 D 2
9 D 2
10 D 3]

Use groupby function to group each repeated elements in a row use for loop to loop through each group
grouped_df = df.groupby('column A')
for group in grouped_df:
print(group)

Secondary row value of highest rolling sums pandas

I am trying to get the max value of one row, according to the cumulative sum of a different row. My dataframe looks like this:
df = pd.DataFrame({'constant': ['a', 'b', 'b', 'c', 'c', 'd', 'a'], 'value': [1, 3, 1, 5, 1, 9, 2]})
indx constant value
0 a 1
1 b 3
2 b 1
3 c 5
4 c 1
5 d 9
6 a 2
I am trying to add a new field, with the constant that has the highest cumulative sum of value up to that point in the dataframe. the final dataframe would look like this:
indx constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
As you can see, at index 1, a has the highest cumulative sum of value for all prior rows. At index 2, b has the highest cumulative sum of value for all prior rows, and so on.
Anyone have a solution?

As presented, you just need a shift. However try the following for other scenarios.
Steps
Find the cummulative maximum
Where the cummulative max is equal to df['value'], copy the 'constant', otherwise make it a NaN
The NaNs should leave chance to broadcast the constant corresponding to the max value
Outcome
df=df.assign(new_field=(np.where(df['value']==df['value'].cummax(), df['constant'], np.nan))).ffill()
df=df.assign(new_field=df['new_field'].shift())
constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d

I think you should try and approach this as a pivot table, which would allow you to use np.argmax over the column axis.
# this will count cummulative occurences over the ix for each value of `constant`
X = df.pivot_table(
index=df.index,
columns=['constant'],
values='value'
).fillna(0.0).cumsum(axis=0)
# now you get a list of ixs that max the cummulative value over the column axis - i.e., the "winner"
colix = np.argmax(X.values, axis=1)
# you can fetch corresponding column names using this argmax index
df['winner'] = np.r_[[np.nan], X.columns[colix].values[:-1]]
# and there you go
df
constant value winner
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d

You should be a little more careful (since values can be negative value which decrease cumsum), here is what you probably need to do,
df["cumsum"] = df["value"].cumsum()
df["cummax"] = df["cumsum"].cummax()
df["new"] = np.where(df["cumsum"] == df["cummax"], df['constant'], np.nan)
df["new"] = df.ffill()["new"].shift()
df

Merge on key and keep values of first dataframe

I have two dataframes:
df1
key value
A 1
B 2
C 2
D 3
df2
key value
C 3
D 3
E 5
F 7
I would like to merge this dataframes by their key and get a dataframe which looks like this one. So, I want to get only one column (no new column with suffixes) and remove the value of df2 if the values do not fit together.
df_merged
key value
A 1
B 2
C 2
D 3
E 5
F 7
How can I do this? Should I rather take join or concatenate? Thanks a lot!

Use concat with DataFrame.drop_duplicates by column key:
df = pd.concat([df1, df2], ignore_index=True).drop_duplicates('key')
print (df)
key value
0 A 1
1 B 2
2 C 2
3 D 3
6 E 5
7 F 7

Just adding to #jezrael's answer, you could also use groupby with first:
>>> pd.concat([df1, df2], ignore_index=True).groupby('key', as_index=False).first()
key value
0 A 1
1 B 2
2 C 2
3 D 3
4 E 5
5 F 7
>>>

python 3 get the column name depending of a condition [duplicate]

This question already has answers here:
Create a column in a dataframe that is a string of characters summarizing data in other columns
(3 answers)
Closed 4 years ago.
So i have a pandas df (python 3.6) like this
index A B C ...
A 1 5 0
B 0 0 1
C 1 2 4
...
As you can see, the index values are the same as the columns names.
What i'm trying to do is to get a new column in the dataframe that has the name of the columns where the value is > than 0
index A B C ... NewColumn
A 1 5 0 [A,B]
B 0 0 1 [C]
C 1 2 4 [A,B,C]
...
i've been trying with iterrows with no success
also i know i can melt and pivot but i think there should be a way with apply lamnda maybe?
Thanks in advance

If new column should be string compare by DataFrame.gt with dot product with columns, last remove trailing separator:
df['NewColumn'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
A B C NewColumn
A 1 5 0 A, B
B 0 0 1 C
C 1 2 4 A, B, C
And for lists use apply with lambda function:
df['NewColumn'] = df.gt(0).apply(lambda x: x.index[x].tolist(), axis=1)
print (df)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]

Use:
df['NewColumn'] = df.apply(lambda x: list(x[x.gt(0)].index),axis=1)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]

You could use .gt to check which values are greater than 0 and .dot to obtain the corresponding columns. Finally .apply(list) to turn the results to lists:
df.loc[:, 'NewColumn'] = df.gt(0).dot(df.columns).apply(list)
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Note: works with single letter columns, otherwise you could do:
df.loc[:, 'NewColumn'] = ((df.gt(0) # df.columns.map('{},'.format))
.str.rstrip(',').str.split(','))
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.

You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5

df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can a duplicate row be dropped with some condition [duplicate] - python

Check with sort_values +drop_duplicates df.sort_values('B').drop_duplicates(['A'],keep='last') Out[127]: A B C 1 1 1 b 3 2 3 d

df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C']) # A #1 b #2 d Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function

Here's a little fun with groupby and nlargest: (df.set_index('C') .groupby('A')['B'] .nlargest(1) .index .to_frame() .reset_index(drop=True)) A C 0 1 b 1 2 d Or, sort_values, groupby, and last: df.sort_values('B').groupby('A')['C'].last().reset_index() A C 0 1 b 1 2 d

Similar solution to #Jondiedoop, but avoids the apply: u = df.groupby('A')['B'].idxmax() df.loc[u, ['A', 'C']].reset_index(drop=1) A C 0 1 b 1 2 d

Related

Find duplicate rows in one column and print duplicated rows to a new dataframe table as a group using python pandas

Secondary row value of highest rolling sums pandas

Merge on key and keep values of first dataframe

python 3 get the column name depending of a condition [duplicate]

Adding a column in dataframes based on similar columns in them

Categories

Resources