I have the following dataframe. I want to group by a and b first. Within each group, I need to do a value count based on c and only pick the one with most counts. If there are more than one c values for one group with the most counts, just pick any one.
a b c
1 1 x
1 1 y
1 1 y
1 2 y
1 2 y
1 2 z
2 1 z
2 1 z
2 1 a
2 1 a
The expected result would be
a b c
1 1 y
1 2 y
2 1 z
What is the right way to do it? It would be even better if I can print out each group with c's value counts sorted as an intermediate step.
You are looking for .value_counts():
df.groupby(['a', 'b'])['c'].value_counts()
a b c
1 1 y 2
x 1
2 y 2
z 1
2 1 a 2
z 2
Name: c, dtype: int64
group the original dataframe by ['a', 'b'] and get the .max() should work
df.groupby(['a', 'b'])['c'].max()
you can also aggregate 'count' and 'max' values
df.groupby(['a', 'b'])['c'].agg({'max': max, 'count': 'count'}).reset_index()
Try:
df=df.groupby(["a", "b", "c"])["c"].count().sort_values(ascending=False).reset_index(name="dropme").drop_duplicates(subset=["a", "b"], keep="first").drop("dropme", axis=1)
Outputs:
a b c
0 2 1 z
2 1 2 y
3 1 1 y
Related
Lets say I have a df like this, need to groupby on links,
and if a link repeated more than 3 times, should increment its value
name links
A https://a.com/-pg0
B https://b.com/-pg0
C https://c.com/-pg0
D https://c.com/-pg0
x https://c.com/-pg0
y https://c.com/-pg0
z https://c.com/-pg0
E https://e.com/-pg0
F https://e.com/-pg0
Expected output, here names C,D,x,y,z, repeated more than 3, so first 3 will be zero and next will be incremented
name links
A https://a.com/-pg0
B https://b.com/-pg0
C https://c.com/-pg0
D https://c.com/-pg0
x https://c.com/-pg0
y https://c.com/-pg1
z https://c.com/-pg1
E https://e.com/-pg0
F https://e.com/-pg0
You can try cumcount with //
s = df.groupby('links').cumcount()//3
Out[125]:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 0
8 0
dtype: int64
df['links'] = df['links'] + s.astype(str)
I have a dataframe:
id to from flag
1 a x 1
1 a y 0
2 c z 1
2 c m 1
2 b v 0
2 b p 0
and I want to groupby(['id', 'to']) and return a list of the elements in from that have a flag 1 only. If no element has a flag 1, then the resulting output should be 'None'. The desired output should be:
id to from
1 a ['x']
2 c ['z','m']
2 b None
I can do it with apply i.e.
out_df = df.groupby(['id', 'to'])['from'].apply(
lambda x: match_to_list(x['from'], x['flag'])).reset_index()
where:
def match_to_list(to, flag):
matches = list(to.iloc[flag.nonzero()[0]])
if len(matches) == 0:
return 'None'
else:
matches
but this is taking too long and I think there must be a better way that I am missing.
Any help/insights would be very appreciated! TIA
IIUC 1st create the index , with MultiIndex, then we do groupby with agg
idx=pd.MultiIndex.from_tuples(list(map(tuple,df[['id','to']].drop_duplicates().values.tolist())))
yourdf=df.loc[df.flag==1].groupby(['id','to'])['from'].agg(list).reindex(idx).reset_index()
yourdf
Out[13]:
level_0 level_1 from
0 1 a [x]
1 2 c [z, m]
2 2 b NaN
Or just using apply , less efficient but more readable
df.groupby(['id','to']).apply(lambda x : x['from'][x['flag']==1].tolist() if (x['flag']==1).any() else None).reset_index()
Out[17]:
id to 0
0 1 a [x]
1 2 b None
2 2 c [z, m]
I have a pandas dataframe that contains a column with possible duplicates. I would like to create a column that will produce a 1 if the row is duplicate and 0 if it is not.
So if I have:
A|B
1 1|x
2 2|y
3 1|x
4 3|z
I would get:
A|B|C
1 1|x|1
2 2|y|0
3 1|x|1
4 3|z|0
I tried df['C'] = np.where(df['A']==df['A'], '1', '0') but this just created a column of all 1's in C.
You need Series.duplicated with parameter keep=False for all duplicates first, then cast boolean mask (Trues and Falses) to 1s and 0s by astype by int and if necessary then cast to str:
df['C'] = df['A'].duplicated(keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
If need check duplicates in columns A and B together use DataFrame.duplicated:
df['C'] = df.duplicated(subset=['A','B'], keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
And numpy.where solution:
df['C'] = np.where(df['A'].duplicated(keep=False), '1', '0')
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
I have a dataframe (df) as such:
A B
1 a
2 b
3 c
And a series: S = pd.Series(['x','y','z']) I want to repeat the dataframe df for each value in the series. The expected result is to be like this:
result:
S A B
x 1 a
y 1 a
z 1 a
x 2 b
y 2 b
z 2 b
x 3 c
y 3 c
z 3 c
How do I achieve this kind of output? I'm thinking of merge or join but mergeing is giving me a memory error. I am dealing with a rather large dataframe and series. Thanks!
Using numpy, lets say you have series and df of diffenent lengths
s= pd.Series(['X', 'Y', 'Z', 'A']) #added a character to s to make it length 4
s_n = len(s)
df_n = len(df)
pd.DataFrame(np.repeat(df.values,s_n, axis = 0), columns = df.columns, index = np.tile(s,df_n)).rename_axis('S').reset_index()
S A B
0 X 1 a
1 Y 1 a
2 Z 1 a
3 A 1 a
4 X 2 b
5 Y 2 b
6 Z 2 b
7 A 2 b
8 X 3 c
9 Y 3 c
10 Z 3 c
11 A 3 c
UPDATE:
here is a bit changed #A-Za-z's solution which might be bit more memory saving, but it's slower:
x = pd.DataFrame(index=range(len(df) * len(S)))
for col in df.columns:
x[col] = np.repeat(df[col], len(s))
x['S'] = np.tile(S, len(df))
Old incorrect answer:
In [94]: pd.concat([df.assign(S=S)] * len(s))
Out[94]:
A B S
0 1 a x
1 2 b y
2 3 c z
0 1 a x
1 2 b y
2 3 c z
0 1 a x
1 2 b y
2 3 c z
Setup
df = pd.DataFrame({'A': {0: 1, 1: 2, 2: 3}, 'B': {0: 'a', 1: 'b', 2: 'c'}})
S = pd.Series(['x','y','z'], name='S')
Solution
#Convert the Series to a Dataframe with desired shape of the output filled with S values.
#Join df_S to df to get As and Bs
df_S = pd.DataFrame(index=np.repeat(S.index,3), columns=['S'], data= np.tile(S.values,3))
df_S.join(df)
Out[54]:
S A B
0 x 1 a
0 y 1 a
0 z 1 a
1 x 2 b
1 y 2 b
1 z 2 b
2 x 3 c
2 y 3 c
2 z 3 c
Suppose I create a pandas DataFrame with two columns, one of which contains some numbers and the other contains letters. Like this:
import pandas as pd
from pprint import pprint
df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
pprint(df)
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
Now say that I want to make a third column (c) whose value is equal to the last value of a when b was equal to x. In the cases where a value of x was not encountered in b yet, the value in c should default to 0.
The procedure should produce pretty much the following result:
last_a = 0
c = []
for i,b in enumerate(df['b']):
if b == 'x':
last_a = df.iloc[i]['a']
c += [last_a]
df['c'] = c
pprint(df)
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4
Is there a more elegant way to accomplish this either with or without pandas?
In [140]: df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
In [141]: df
Out[141]:
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
FInd out where column 'b' == x, then return the value in that column (not the location); this column is already the 'a' column
In [142]: df['c'] = df.loc[df['b']=='x','a'].apply(lambda v: v if v < len(df) else np.nan)
Fill the rest of the values forward, then fill holes with 0
In [143]: df['c'] = df['c'].ffill().fillna(0)
In [144]: df
Out[144]:
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4