pandas groupby apply list from column based on binary column - python

I have a dataframe:
id to from flag
1 a x 1
1 a y 0
2 c z 1
2 c m 1
2 b v 0
2 b p 0
and I want to groupby(['id', 'to']) and return a list of the elements in from that have a flag 1 only. If no element has a flag 1, then the resulting output should be 'None'. The desired output should be:
id to from
1 a ['x']
2 c ['z','m']
2 b None
I can do it with apply i.e.
out_df = df.groupby(['id', 'to'])['from'].apply(
lambda x: match_to_list(x['from'], x['flag'])).reset_index()
where:
def match_to_list(to, flag):
matches = list(to.iloc[flag.nonzero()[0]])
if len(matches) == 0:
return 'None'
else:
matches
but this is taking too long and I think there must be a better way that I am missing.
Any help/insights would be very appreciated! TIA

IIUC 1st create the index , with MultiIndex, then we do groupby with agg
idx=pd.MultiIndex.from_tuples(list(map(tuple,df[['id','to']].drop_duplicates().values.tolist())))
yourdf=df.loc[df.flag==1].groupby(['id','to'])['from'].agg(list).reindex(idx).reset_index()
yourdf
Out[13]:
level_0 level_1 from
0 1 a [x]
1 2 c [z, m]
2 2 b NaN
Or just using apply , less efficient but more readable
df.groupby(['id','to']).apply(lambda x : x['from'][x['flag']==1].tolist() if (x['flag']==1).any() else None).reset_index()
Out[17]:
id to 0
0 1 a [x]
1 2 b None
2 2 c [z, m]

Related

Pandas: how to do value counts within groups

I have the following dataframe. I want to group by a and b first. Within each group, I need to do a value count based on c and only pick the one with most counts. If there are more than one c values for one group with the most counts, just pick any one.
a b c
1 1 x
1 1 y
1 1 y
1 2 y
1 2 y
1 2 z
2 1 z
2 1 z
2 1 a
2 1 a
The expected result would be
a b c
1 1 y
1 2 y
2 1 z
What is the right way to do it? It would be even better if I can print out each group with c's value counts sorted as an intermediate step.
You are looking for .value_counts():
df.groupby(['a', 'b'])['c'].value_counts()
a b c
1 1 y 2
x 1
2 y 2
z 1
2 1 a 2
z 2
Name: c, dtype: int64
group the original dataframe by ['a', 'b'] and get the .max() should work
df.groupby(['a', 'b'])['c'].max()
you can also aggregate 'count' and 'max' values
df.groupby(['a', 'b'])['c'].agg({'max': max, 'count': 'count'}).reset_index()
Try:
df=df.groupby(["a", "b", "c"])["c"].count().sort_values(ascending=False).reset_index(name="dropme").drop_duplicates(subset=["a", "b"], keep="first").drop("dropme", axis=1)
Outputs:
a b c
0 2 1 z
2 1 2 y
3 1 1 y

Count the frequency of list element in a row grouped by Date and tag

I have a dataframe df which looks like this:
ID Date Input
1 1-Nov A,B
1 2-NOV A
2 3-NOV A,B,C
2 4-NOV B,D
i want my output to count the occurrence of each input, if it is consecutive otherwise reset it to zero again(if IDs are same then only count) , Also the output should be renamed to X.A, X.B, X.C and X.D so my output will look like this:
ID Date Input X.A X.B X.C X.D
1 1-NOV A,B 1 1 0 0
1 2-NOV A 2 0 0 0
2 3-NOV A,B,C 1 1 1 0
2 4-NOV B,D 0 2 0 1
How can I create the output(A,B,C and D) which will count the input occurence date and ID wise.
Use Series.str.get_dummies for indicator columns and then count consecutive 1 per groups - so use GroupBy.cumsum with subtract by GroupBy.ffill, change columns names by DataFrame.add_prefix and last DataFrame.join to original:
a = df['Input'].str.get_dummies(',') == 1
b = a.groupby(df.ID).cumsum().astype(int)
df1 = (b-b.mask(a).groupby(df.ID).ffill().fillna(0).astype(int)).add_prefix('X.')
df = df.join(df1)
print (df)
ID Date Input X.A X.B X.C X.D
0 1 1-Nov A,B 1 1 0 0
1 1 2-NOV A 2 0 0 0
2 2 3-NOV A,B,C 1 1 1 0
3 2 4-NOV B,D 0 2 0 1
first add the counts of new columns and then use group by to make a cumulative sum
# find which columns to add
cols = set([l for sublist in df['Input'].apply(lambda x: x.split(',')).values for l in sublist])
# add the new columns
for col in cols:
df['X.' + col] = df['Input'].apply(lambda x: int(col in x))
# group by and add cumulative sum conditional it has a positive value
group = df.groupby('ID')
for col in cols:
df['X.' + col] = group['X.' + col].apply(lambda x: np.cumsum(x) * (x > 0).astype(int))
results is then
print(df)
ID Date Input X.C X.D X.A X.B
0 1 1-NOV A,B 0 0 1 1
1 1 2-NOV A 0 0 2 0
2 2 3-NOV A,B,C 1 0 1 1
3 2 4-NOV B,D 0 1 0 2

Python: how to drop duplicates with duplicates?

I have a dataframe like the following
df
Name Y
0 A 1
1 A 0
2 B 0
3 B 0
5 C 1
I want to drop the duplicates of Name and keep the ones that have Y=1 such as:
df
Name Y
0 A 1
1 B 0
2 C 1
Use drop_duplicates method,
df.sort_values('Y', ascending= False).drop_duplicates(subset=['Name'])
groupby + max
Assuming your Y series consists only of 0 and 1 values:
res = df.groupby('Name', as_index=False)['Y'].max()
print(res)
Name Y
0 A 1
1 B 0
2 C 1
Does 'Y' column contain only 0-1? In that case, you can try the following :
df = df.sort_values(['Y'], ascending= False)
df = df.drop_duplicates(['Name'])

Get substring in one column based on the value in another column

My dataframe looks like this:
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df
Length name
0 2 a12
1 1 b1
2 0 c
I would like to have a result like this:
Length name
0 2 a
1 1 b
2 0 c
With this code: Getting substring based on another column in a pandas dataframe
test_df.apply(lambda x: x['name'][:-x['Length']],axis = 1)
test_df
I got the same dataframe than before
Length name
0 2 a12
1 1 b1
2 0 c
Modify your apply a bit, to slice with respect to len(x['name']) -
def f(x):
return x['name'][:len(x['name']) - x['Length_to_drop']]
df.apply(f, 1)
0 a
1 b
2 c
dtype: object
Try this:
import pandas as pd
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df['name']=test_df.apply(lambda x: x['name'][:len(x['name'])-x['Length']],axis = 1)
test_df
This will output as you intended
Length name
0 2 a
1 1 b
2 0 c
One can use list functions for this:
outlist = list(map(lambda x,y: x[0:(len(x)-y)], test_df.name, test_df.Length_to_drop))
test_df.name = outlist
print(test_df)
Output:
Length_to_drop name
0 2 a
1 1 b
2 0 c

Finding duplicate rows in a Pandas Dataframe then Adding a column in the Dataframe that states if the row is a duplicate

I have a pandas dataframe that contains a column with possible duplicates. I would like to create a column that will produce a 1 if the row is duplicate and 0 if it is not.
So if I have:
A|B
1 1|x
2 2|y
3 1|x
4 3|z
I would get:
A|B|C
1 1|x|1
2 2|y|0
3 1|x|1
4 3|z|0
I tried df['C'] = np.where(df['A']==df['A'], '1', '0') but this just created a column of all 1's in C.
You need Series.duplicated with parameter keep=False for all duplicates first, then cast boolean mask (Trues and Falses) to 1s and 0s by astype by int and if necessary then cast to str:
df['C'] = df['A'].duplicated(keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
If need check duplicates in columns A and B together use DataFrame.duplicated:
df['C'] = df.duplicated(subset=['A','B'], keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
And numpy.where solution:
df['C'] = np.where(df['A'].duplicated(keep=False), '1', '0')
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0

Categories