I have the following dataframe
A B
b 10
b 5
a 25
a 5
c 6
c 2
b 20
a 10
c 4
c 3
b 15
How can I sort it as follows:
A B
b 20
b 15
b 10
b 5
a 25
a 10
a 5
c 6
c 4
c 3
c 2
Column A is sorted based on the sum of corresponding values in column B in descending order(The sums are b-50, a-40, c-15) .
Create a temporary column _t and sort using sort_values on _t, B
In [269]: (df.assign(_t=df['A'].map(df.groupby('A')['B'].sum()))
.sort_values(by=['_t', 'B'], ascending=False)
.drop('_t', 1))
Out[269]:
A B
6 b 20
10 b 15
0 b 10
1 b 5
2 a 25
7 a 10
3 a 5
4 c 6
8 c 4
9 c 3
5 c 2
Details
In [270]: df.assign(_t=df['A'].map(df.groupby('A')['B'].sum()))
Out[270]:
A B _t
0 b 10 50
1 b 5 50
2 a 25 40
3 a 5 40
4 c 6 15
5 c 2 15
6 b 20 50
7 a 10 40
8 c 4 15
9 c 3 15
10 b 15 50
Related
I have a dataframe like this:
ID Packet Type
1 1 A
2 1 B
3 2 A
4 2 C
5 2 B
6 3 A
7 3 C
8 4 C
9 4 B
10 5 B
11 6 C
12 6 B
13 6 A
14 7 A
I want to filter the dataframe so that I have only entries that are part of a packet with size n and which types are all different. There are only n types.
For this example let's use n=3 and the types A,B,C.
In the end I want this:
ID Packet Type
3 2 A
4 2 C
5 2 B
11 6 C
12 6 B
13 6 A
How do I do this with pandas?
Another solution, using .groupby + .filter:
df = df.groupby("Packet").filter(lambda x: len(x) == x["Type"].nunique() == 3)
print(df)
Prints:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
You can do transform with nunique
out = df[df.groupby('Packet')['Type'].transform('nunique')==3]
Out[46]:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
I'd loop over the groupby object, filter and concatenate:
>>> pd.concat(frame for _,frame in df.groupby("Packet") if len(frame) == 3 and frame.Type.is_unique)
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
I have a dataframe like below,
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
I want to make duplicate of dataframe by length of its contents.
Basically I want to get each value in col1 to be added with entire contents of col2.
Input:
col1 col2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
O/P:
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
For this I have used this,
op=[]
for val in df.col1.values:
temp=df.copy()
temp['col1']=val
op.append(temp)
print(pd.concat(op,ignore_index=True))
I want to get exact output in a better way(excluding loop)
with unstack :
pd.DataFrame(index=df.col2,columns=df.col1).unstack().reset_index().drop(0,1)
try itertools to do that
import pandas as pd
from itertools import product
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
res = pd.DataFrame((product(df['col1'],df['col2'])),columns=['col1','col2'])
print(res)
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
I hope it would solve your problem
Use cross join and filter necessary columns:
df = pd.merge(df.assign(a=1), df.assign(a=1), on='a')[['col1_x','col2_y']]
df = df.rename(columns = lambda x: x.split('_')[0])
print (df)
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
So, what you want is the cartesian product. I would do it like this:
from intertools import product
pd.DataFrame(product(*[df.col1.values,df.col2.values]),columns=["col1","col2"])
#output
0 1
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
You need to input the name of the columns again, thou.
Well..essentially anything that gives you cartesian product would do. For example,
pd.MultiIndex.from_product([df['col1'],df['col2']]).to_frame(index=False, name=['Col1','Col2'])
There you go:
=^..^=
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
# repeat col1 values
df_col1 = df['col1'].repeat(df.shape[0]).reset_index().drop(['index'], axis=1)
# multiply col2 values
df_col2 = pd.concat([df['col2']]*df.shape[0], ignore_index=True)
# contact results
result = pd.concat([df_col1, df_col2], axis=1, sort=False)
Output:
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
I have to filter out random sample from Data on which:
'a' should have 6 values,
'b' should have 4 values and
'c' should have 7 values randomly.
Data Value
a 1
a 2
a 3
a 4
a 5
a 6
a 7
a 8
a 9
a 10
b 1
b 2
b 3
b 4
b 5
b 6
b 7
b 8
b 9
c 1
c 2
c 3
c 4
c 5
c 6
c 7
c 8
I want output as:
Data Value
a 3
a 5
a 7
a 2
a 4
a 9
b 3
b 5
b 7
b 8
c 1
c 3
c 4
c 5
c 6
c 7
c 9
First define counts of samples for each group and then groupby with sample:
d = {'a':6, 'b':4, 'c':7}
df = df.groupby('Data', group_keys=False).apply(lambda x: x.sample(d[x.name]))
print (df)
Data Value
7 a 8
5 a 6
0 a 1
2 a 3
9 a 10
8 a 9
17 b 8
18 b 9
15 b 6
14 b 5
22 c 4
23 c 5
25 c 7
21 c 3
20 c 2
24 c 6
19 c 1
Another aproach with filtering only values of matched of keys of dict:
d = {'a':6, 'b':4, 'c':7}
df = pd.concat([df[df['Data'].eq(k)].sample(v) for k, v in d.items()], ignore_index=True)
considering the following dataset:
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['b',4, 5, 6,'T'],
['b',9, 9, 39,'T'],
['c',16, 17 , 18,'N']])
, columns=['id','A', 'B', 'C','Active'])
id A B C Active
a 1 2 3 T
b 4 5 6 T
b 9 9 39 T
c 16 17 18 N
I need to augment each rows of each groups(id) by rows that the active = T , which means that
a 1 2 3 a 1 2 3
b 4 5 6 a 1 2 3
b 9 9 39 a 1 2 3
a 1 2 3 b 4 5 6
b 4 5 6 b 4 5 6
b 9 9 39 b 4 5 6
a 1 2 3 b 9 9 39
b 4 5 6 b 9 9 39
b 9 9 39 b 9 9 39
a 1 2 3 c 16 17 18
b 9 9 39 c 16 17 18
b 4 5 6 c 16 17 18
I have an idea which I could not implement it.
first, make a new dataset by filtering data.
take all rows that active column is equal to T and save it in a new df.
df_t = df [df['Active']=='T']
then for each rows of df add a new vector form df_t dataset.
which means that :
for sample in df:
for t in df_t:
df_new = sample + t ( vectors of df and df_t join together)
Df_new = concat(df_new,Df_new)
I really appreciate your comments and suggestion to implement my own idea!
You want the catersian cross product of df and df_t. You can do it with a bit of a hack like this:
df['cross'] = 1
df_t['cross'] = 1
df_new = pd.merge(df,df_t.drop('Active',axis=1),on='cross').drop('cross',axis=1)
Putting it all together:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['b',4, 5, 6,'T'],
['b',9, 9, 39,'T'],
['c',16, 17 , 18,'N']])
, columns=['id','A', 'B', 'C','Active'])
df_t = df [df['Active']=='T']
df['cross'] = 1
df_t['cross'] = 1
df_new = pd.merge(df,df_t.drop('Active',axis=1),on='cross').drop('cross',axis=1)
results in:
>>> df_new
id_x A_x B_x C_x Active id_y A_y B_y C_y
0 a 1 2 3 T a 1 2 3
1 a 1 2 3 T b 4 5 6
2 a 1 2 3 T b 9 9 39
3 b 4 5 6 T a 1 2 3
4 b 4 5 6 T b 4 5 6
5 b 4 5 6 T b 9 9 39
6 b 9 9 39 T a 1 2 3
7 b 9 9 39 T b 4 5 6
8 b 9 9 39 T b 9 9 39
9 c 16 17 18 N a 1 2 3
10 c 16 17 18 N b 4 5 6
11 c 16 17 18 N b 9 9 39
consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))