For example:
A B C
1 1 2
2 1 2
3 3 3
3 2 1
I want to add a column D which represent the same numbers of values across A, B and C.
D
2
2
3
1
Option 1
You can use stack + groupby + value_counts:
df['D'] = df.stack().groupby(level=0).value_counts().max(level=0)
df
A B C D
0 1 1 2 2
1 2 1 2 2
2 3 3 3 3
3 3 2 1 1
If you want the number which has the highest mode, chain a groupby + head call -
v = (df.stack()
.groupby(level=0)
.value_counts()
.groupby(level=0)
.head(1)
.reset_index(level=0, drop=True)
)
1 2
2 2
3 3
1 1
dtype: int64
df['Num'], df['Num_Mode'] = v.index, v.values # to assign it
If multiple numbers has the same highest mode, only one of them are returned.
Option 2
Another option inspired by #Wen, using apply with pd.Series.mode -
df['D'] = df.stack().groupby(level=0).apply(lambda x: pd.Series.mode(x).max())
Or,
df['D'] = df.apply(pd.Series.mode, 1).max(1).astype(int)
df
A B C D
0 1 1 2 2
1 2 1 2 2
2 3 3 3 3
3 3 2 1 1
scipy mode can return the count as well
from scipy import stats
df['D']=stats.mode(df.values,1)[1]
df
Out[829]:
A B C D
0 1 1 2 2
1 2 1 2 2
2 3 3 3 3
3 3 2 1 1
More Info:
stats.mode(df.values,1)
Out[830]:
ModeResult(mode=array([[1],
[2],
[3],
[1]], dtype=int64), count=array([[2],
[3],
[4],
[2]]))
Related
I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!
groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I have the following this question where one of the columns is an object (list type cell):
I don't want to use explode (using an older version of pandas). How to do the same for dataframe with three columns?
df
A B C
0 1 [1, 2] 3
1 1 [1, 2] 4
2 2 [3, 4] 5
My expected output is:
A B C
0 1 1 3
1 1 2 3
3 1 1 4
4 1 2 4
5 2 3 5
6 2 4 5
I found these two methods useful.
How to add third column to this code.
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
or
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
You set the index to be all of the columns you want to keep tied to the list you explode:
(df.set_index(['A', 'C'])['B']
.apply(pd.Series).stack()
.reset_index()
.drop(columns='level_2').rename(columns={0: 'B'}))
A C B
0 1 3 1
1 1 3 2
2 1 4 1
3 1 4 2
4 2 5 3
5 2 5 4
Or for the second method also repeat 'C'
pd.DataFrame({'A': df.A.repeat(df.B.str.len()),
'C': df.C.repeat(df.B.str.len()),
'B': np.concatenate(df.B.to_numpy())})
You can use itertools to reshape your data :
from itertools import product,chain
pd.DataFrame(chain.from_iterable((product([a],b,[c]))
for a,b,c in df.to_numpy()),
columns = df.columns)
A B C
0 1 1 3
1 1 2 3
2 1 1 4
3 1 2 4
4 2 1 5
5 2 4 5
df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
I group the following pandas dataframe by 'name' and then apply several lambda functions on 'value' to generate additional columns.
Is it possible to apply these lambda functions at once, to increase efficiency?
import pandas as pd
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'value': [1, 3, 1, 2, 3, 1, 2, 3, 3], })
df['Diff'] = df.groupby('name')['value'].transform(lambda x: x - x.iloc[0])
df['Count'] = df.groupby('name')['value'].transform(lambda x: x.count())
df['Index'] = df.groupby('name')['value'].transform(lambda x: x.index - x.index[0] + 1)
print(df)
Output:
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3
Here is possible use GroupBy.apply with one function, but not sure if better performance:
def f(x):
a = x - x.iloc[0]
b = x.count()
c = x.index - x.index[0] + 1
return pd.DataFrame({'Diff':a, 'Count':b, 'Index':c})
df = df.join(df.groupby('name')['value'].apply(f))
print(df)
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3
I have three columns, A, B and C. I want to create a fourth column D that contains values of A or B, based on the value of C. For example:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
In the above example, column D takes the value of column A if the value of C is 1 and the value of column B if the value of C is 0. Is there an elegant way to do it in Pandas? Thank you for your help.
Use numpy.where:
In [20]: df
Out[20]:
A B C
0 1 2 1
1 2 3 0
2 3 4 0
3 4 5 1
In [21]: df['D'] = np.where(df.C, df.A, df.B)
In [22]: df
Out[22]:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
pandas
In consideration of the OP's request
Is there an elegant way to do it in Pandas?
my opinion of elegance
and idiomatic pure pandas
assign + pd.Series.where
df.assign(D=df.A.where(df.C, df.B))
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
response to comment
how would you modify the pandas answer if instead of 0, 1 in column C you had A, B?
df.assign(D=df.lookup(df.index, df.C))
A B C D
0 1 2 A 1
1 2 3 B 3
2 3 4 B 4
3 4 5 A 4