Pandas Groupby remove row where specific value combination occurs - python

As Stated, I would like to remove a specific row based on group by logic. In the below dataframe wherever combination for F and G occurs for an ID, I would like to remove row with value G.
import pandas as pd
op_d = {'ID': [1,1,2,2,3,4],'Value':['F','G','K','G','H','G']}
df = pd.DataFrame(data=op_d)
df
In this case, I would like to remove second row with value 'G' for ID = 1. So far
temp = df.groupby('ID').apply(lambda x: (x['Value'].nunique()>1)).reset_index().rename(columns={0:'Expected_Output'})
temp = temp.loc[temp['Expected_Output']==True]
multiple_options = df.loc[df['ID'].isin(temp['ID'])]
So far, I am able to figure out where each ID has a multiple value. Could you tell how to remove this specific row ?

Use Series.eq + Series.groupby transform with any:
m1, m2 = df['Value'].eq('F'), df['Value'].eq('G')
m = m2 & m1.groupby(df['ID']).transform('any') & m2.groupby(df['ID']).transform('any')
df1 = df[~m]
Result:
print(df1)
ID Value
0 1 F
2 2 K
3 2 G
4 3 H
5 4 G

Using isin:
c = (df['Value'].isin(['F','G']).groupby(df['ID']).transform('sum').eq(2)
& df['Value'].eq('G'))
out = df[~c].copy()
ID Value
0 1 F
2 1 H
3 2 K
4 2 G
5 3 H
6 4 G

Related

Pandas: Get top n columns based on a row values

Having a dataframe with a single row, I need to filter it into a smaller one with filtered columns based on a value in a row.
What's the most effective way?
df = pd.DataFrame({'a':[1], 'b':[10], 'c':[3], 'd':[5]})
a
b
c
d
1
10
3
5
For example top-3 features:
b
c
d
10
3
5
Use sorting per row and select first 3 values:
df1 = df.sort_values(0, axis=1, ascending=False).iloc[:, :3]
print (df1)
b d c
0 10 5 3
Solution with Series.nlargest:
df1 = df.iloc[0].nlargest(3).to_frame().T
print (df1)
b d c
0 10 5 3
You can transpose T, and use nlargest():
new = df.T.nlargest(columns = 0, n = 3).T
print(new)
b d c
0 10 5 3
You can use np.argsort to get the solution. This Numpy method, in the below code, gives the indices of the column values in descending order. Then slicing selects the largest n values' indices.
import pandas as pd
import numpy as np
# Your dataframe
df = pd.DataFrame({'a':[1], 'b':[10], 'c':[3], 'd':[5]})
# Pick the number n to find n largest values
nlargest = 3
# Get the order of the largest value columns by their indices
order = np.argsort(-df.values, axis=1)[:, :nlargest]
# Find the columns with the largest values
top_features = df.columns[order].tolist()[0]
# Filter the dateframe by the columns
top_features_df = df[top_features]
top_features_df
output:
b d c
0 10 5 3

How to get a transition string per row object based on two different columns in python (without using loops)?

I have the following data structure:
The columns s and d are indicating the transition of object in column x. What I want to do is get a transition string per object present in the column x. For e.g. with a new column as follows:
Is there a lean way to do it using pandas, without using too many loops?
This was the code I tried:
obj = df['x'].tolist()
rows = []
for o in obj:
locs = df[df['x'] == o]['s'].tolist()
str_locs = '->'.join(str(l) for l in locs)
print(str_locs)
d = dict()
d['x'] = o
d['new'] = str_locs
rows.append(d)
tmp = pd.DataFrame(rows)
This give the output temp as:
x new
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
b 1->2
b 1->2
Example df:
df = pd.DataFrame({"x":["a","a","a","a","b","b"], "s":[1,2,4,8,5,11],"d":[2,4,8,9,11,12]})
print(df)
x s d
0 a 1 2
1 a 2 4
2 a 4 8
3 a 8 9
4 b 5 11
5 b 11 12
Following code will generate a transition string of all objects present in the column x.
groupby with respect to column x and get list of lists of s and d for every object available in x
Merge the list of lists sequentially
Remove consecutive duplicates from the merged list using itertools.groupby
Join the items of merged list with -> to make it a single string.
Finally map the series to column x of input df
from itertools import groupby
grp = df.groupby('x')[['s', 'd']].apply(lambda x: x.values.tolist())
grp = grp.apply(lambda x: [str(item) for tup in x for item in tup])
sr = grp.apply(lambda x: "->".join([i[0] for i in groupby(x)]))
df["new"] = df["x"].map(sr)
print(df)
x s d new
0 a 1 2 1->2->4->8->9
1 a 2 4 1->2->4->8->9
2 a 4 8 1->2->4->8->9
3 a 8 9 1->2->4->8->9
4 b 5 11 5->11->12
5 b 11 12 5->11->12

how to drop rows based on other column only if has multiple different values

What I have?
I have a dataframe like this:
id value
0 0 5
1 0 5
2 0 6
3 1 7
4 1 7
What I want to get?
I want to drop all the rows with id that has more than one different value. in the example above I want to drop all the rows with id = 0
id value
3 1 7
4 1 7
What I have tried?
import pandas as pd
df = pd.DataFrame({'id':[0, 0, 0, 1, 1], 'value':[5,5,6,7,7]})
print(df)
id_list = df['id'].tolist()
id_set = set(id_list)
for id in id_set:
temp_list = df.loc[df['id'] == id,'value'].tolist()
s = set(temp_list)
if len(s) > 1:
df = df.loc[df['id'] != id]
it works, but it ugly and inefficient
There is a better pytonic way using pandas methods?
Use GroupBy.transform with DataFrameGroupBy.nunique for number of unique values to Series, so possible compare and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform('nunique').eq(1)]
print (df)
id value
3 1 7
4 1 7
# Try this code #
import pandas as pd
id1 = pd.Series([0,0,0,1,1])
value = pd.Series([5,5,6,7,7])
data = pd.DataFrame({'id':id1,'value':value})
datag = data.groupby('id')
# to delete rows,that id have different values
datadel = []
for i in set(data.id):
if len(set(datag.get_group(i)['value'])) != 1:
datadel.extend(data.loc[data["id"] == i].index.tolist())
data.drop(datadel, inplace = True)
print(data)

Add column to DataFrame in a loop

Let's say I have a very simple pandas dataframe, containing a single indexed column with "initial values". I want to read in a loop N other dataframes to fill a single "comparison" column, with matching indices.
For instance, with my inital dataframe as
Initial
0 a
1 b
2 c
3 d
and the following two dataframes to read in a loop
Comparison
0 e
1 f
Comparison
2 g
3 h
4 i <= note that this index doesn't exist in Initial so won't be matched
I would like to produce the following result
Initial Comparison
0 a e
1 b f
2 c g
3 d h
Using merge, concat or join, I only ever seem to be able to create a new column for each iteration of the loop, filling the blanks with NaN.
What's the most pandas-pythonic way of achieving this?
Below an example from the proposed duplicate solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
print df1
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
print df2
df3 = pd.DataFrame(np.array([[2,'g'],[3,'h'],[4,'i']]), columns=['','Compare'])
df3 = df3.set_index('')
print df3
print df1.merge(df2,left_index=True,right_index=True).merge(df3,left_index=True,right_index=True)
>>
Initial
0 a
1 b
2 c
3 d
Compare
0 e
1 f
Compare
2 g
3 h
4 i
Empty DataFrame
Columns: [Initial, Compare_x, Compare_y]
Index: []
Second edit: #W-B, the following seems to work, but it can't be the case that there isn't a simpler option using proper pandas methods. It also requires turning off warnings, which might be dangerous...
pd.options.mode.chained_assignment = None
df1["Compare"]=pd.Series()
for ind in df1.index.values:
if ind in df2.index.values:
df1["Compare"][ind]=df2.T[ind]["Compare"]
if ind in df3.index.values:
df1["Compare"][ind]=df3.T[ind]["Compare"]
print df1
>>
Initial Compare
0 a e
1 b f
2 c g
3 d h
Ok , since Op need more info
Data input
import functools
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
df1['Compare']=np.nan
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
df3 = pd.DataFrame(np.array(['g','h','i']), columns=['Compare'],index=[2,3,4])
Solution
newdf=functools.reduce(lambda x,y: x.fillna(y),[df1,df2,df3])
newdf
Out[639]:
Initial Compare
0 a e
1 b f
2 c g
3 d h

How do I specify a column header for pandas groupby result?

I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks
The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y

Categories