How to multiply dataframe in effective way - python

I have a dataframe like below,
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
I want to make duplicate of dataframe by length of its contents.
Basically I want to get each value in col1 to be added with entire contents of col2.
Input:
col1 col2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
O/P:
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
For this I have used this,
op=[]
for val in df.col1.values:
temp=df.copy()
temp['col1']=val
op.append(temp)
print(pd.concat(op,ignore_index=True))
I want to get exact output in a better way(excluding loop)

with unstack :
pd.DataFrame(index=df.col2,columns=df.col1).unstack().reset_index().drop(0,1)

try itertools to do that
import pandas as pd
from itertools import product
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
res = pd.DataFrame((product(df['col1'],df['col2'])),columns=['col1','col2'])
print(res)
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
I hope it would solve your problem

Use cross join and filter necessary columns:
df = pd.merge(df.assign(a=1), df.assign(a=1), on='a')[['col1_x','col2_y']]
df = df.rename(columns = lambda x: x.split('_')[0])
print (df)
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e

So, what you want is the cartesian product. I would do it like this:
from intertools import product
pd.DataFrame(product(*[df.col1.values,df.col2.values]),columns=["col1","col2"])
#output
0 1
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e
You need to input the name of the columns again, thou.

Well..essentially anything that gives you cartesian product would do. For example,
pd.MultiIndex.from_product([df['col1'],df['col2']]).to_frame(index=False, name=['Col1','Col2'])

There you go:
=^..^=
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],'col2':list('abcde')})
# repeat col1 values
df_col1 = df['col1'].repeat(df.shape[0]).reset_index().drop(['index'], axis=1)
# multiply col2 values
df_col2 = pd.concat([df['col2']]*df.shape[0], ignore_index=True)
# contact results
result = pd.concat([df_col1, df_col2], axis=1, sort=False)
Output:
col1 col2
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
5 2 a
6 2 b
7 2 c
8 2 d
9 2 e
10 3 a
11 3 b
12 3 c
13 3 d
14 3 e
15 4 a
16 4 b
17 4 c
18 4 d
19 4 e
20 5 a
21 5 b
22 5 c
23 5 d
24 5 e

Related

Python: How to add groupby but not affect ngroup()?

per user I want an unique item order (as they click through them). If a item already has been seen, then don't cumulative count, but place the already assigned value there. For example, c,d, g & b in the tables below. I used the function below, but its not getting the job done at the moment. If I add the 'user_id' to the grouper I mess up the ngroup(). Can anyone help me with this?
df['Order Number'] = df.groupby(pd.Grouper(key='Item',sort=False)).ngroup()+1
print(df)
Current Output:
User_id Item Order Number
0 1 b 1
1 1 a 2
2 1 c 3
3 1 d 4
4 1 c 3
5 1 d 4
6 1 e 5
7 1 b 1
8 1 f 6
9 1 g 7
10 1 b 1
-----------------------------
11 2 x 8
12 2 g 7
13 2 g 7
14 2 f 6
15 2 h 9
16 2 i 10
17 2 f 11
18 2 k 12
19 2 l 13
Desired Output:
User_id Item Order Number
0 1 b 1
1 1 a 2
2 1 c 3
3 1 d 4
4 1 c 3
5 1 d 4
6 1 e 5
7 1 b 1
8 1 f 6
9 1 g 7
10 1 b 1
-----------------------------
11 2 x 1
12 2 g 2
13 2 g 2
14 2 f 3
15 2 h 4
16 2 i 5
17 2 f 3
18 2 k 7
19 2 l 8
Use GroupBy.transform with factorize in lambda function:
df['Order Number'] = df.groupby('User_id')['Item'].transform(lambda x: pd.factorize(x)[0])+1
print (df)
User_id Item Order Number
0 1 b 1
1 1 a 2
2 1 c 3
3 1 d 4
4 1 c 3
5 1 d 4
6 1 e 5
7 1 b 1
8 1 f 6
9 1 g 7
10 1 b 1
11 2 x 1
12 2 g 2
13 2 g 2
14 2 f 3
15 2 h 4
16 2 i 5
17 2 f 3
18 2 k 6
19 2 l 7

How to replace values of a pandas column with the most frequent value [duplicate]

This question already has answers here:
GroupBy pandas DataFrame and select most common value
(13 answers)
Closed 3 years ago.
I've already used different answers but not any of them solved my problem.
I also looked at this answer. but it didn't work either.
Here is my dataframe:
import numpy as np
import pandas as pd
np.random.seed(2)
col1 = np.random.choice([1,2,3], size=(50))
col2 = np.random.choice([1,2,3,4], size=(50))
col3 = np.random.choice(['a', 'b', 'c', 'd', 'e'], size=(50))
data = {'col1':col1, 'col2':col2, 'col3':col3}
df = pd.DataFrame(data)
I want to
1) perform a groupby on c1 and c2 columns and
2) create a new column that is the most frequent value on c3 column.
The final df should look like this:
c1 c2 c3 c4
0 1 1 b b
1 1 1 b b
2 1 2 a b
3 1 2 b b
4 1 2 b b
5 1 2 b b
6 1 2 c b
7 1 3 a a
8 1 3 c a
9 1 3 b a
10 1 3 c a
11 1 3 a a
12 1 3 b a
13 1 3 a a
14 1 3 a a
15 1 3 c a
16 1 4 a a
17 2 1 c c
18 2 1 c c
19 2 1 a c
20 2 1 c c
21 2 1 c c
22 2 1 b c
23 2 2 a a
24 2 2 c a
25 2 2 a a
26 2 3 a a
27 2 3 a a
28 2 4 c c
29 2 4 c c
30 3 1 b a
31 3 1 a a
32 3 1 a a
33 3 1 c a
34 3 1 b a
35 3 2 c c
36 3 2 c c
37 3 2 b c
38 3 2 a c
39 3 2 c c
40 3 3 b b
41 3 3 a b
42 3 3 b b
43 3 3 c b
44 3 3 a b
45 3 3 b b
46 3 3 b b
47 3 3 c b
48 3 4 b b
49 3 4 c c
For example I used this code without any success:
df1 = df.groupby(['c1', 'c2'])['c3'].agg(lambda x:x.value_counts().index[0])
The reason .transform(pd.Series.mode) didn't work is because it returned a list when there were two modes. We can solve this by accessing the first value in this list:
df['c4'] = df.groupby(['c1', 'c2'])['c3'].transform(lambda x: x.mode()[0])
Or
df['c4'] = df.groupby(['c1', 'c2'])['c3'].transform(lambda x: pd.Series.mode(x)[0])
c1 c2 c3 c4
0 1 1 b b
1 1 1 b b
2 1 2 a b
3 1 2 b b
4 1 2 b b
5 1 2 b b
6 1 2 c b
7 1 3 a a
8 1 3 c a
9 1 3 b a
10 1 3 c a
11 1 3 a a
12 1 3 b a
13 1 3 a a
14 1 3 a a
15 1 3 c a
16 1 4 a a
17 2 1 c c
18 2 1 c c
19 2 1 a c
20 2 1 c c
21 2 1 c c
22 2 1 b c
23 2 2 a a
24 2 2 c a
25 2 2 a a
26 2 3 a a
27 2 3 a a
28 2 4 c c
29 2 4 c c
30 3 1 b a
31 3 1 a a
32 3 1 a a
33 3 1 c a
34 3 1 b a
35 3 2 c c
36 3 2 c c
37 3 2 b c
38 3 2 a c
39 3 2 c c
40 3 3 b b
41 3 3 a b
42 3 3 b b
43 3 3 c b
44 3 3 a b
45 3 3 b b
46 3 3 b b
47 3 3 c b
48 3 4 b b
49 3 4 c b
You want idxmax:
df['col4'] = df.groupby(['col1', 'col2']).col3.transform(lambda x: x.value_counts().idxmax())
Sample data:
np.random.seed(2)
col1 = np.random.choice([1,2,3], size=(10))
col2 = np.random.choice([1,2,3,4], size=(10))
col3 = np.random.choice(['a', 'b', 'c', 'd', 'e'], size=(10))
data = {'col1':col1, 'col2':col2, 'col3':col3}
df = pd.DataFrame(data)
gives:
col1 col2 col3 col4
0 1 1 d b
1 2 1 c c
2 1 1 b b
3 3 2 c c
4 3 4 e b
5 1 4 d d
6 3 3 a a
7 2 1 e c
8 2 3 d d
9 3 4 b b
You could try finding the mode in each group and then merging it back to the set.
modes = df.groupby(['col1', 'col2'])['col3'].apply(pd.Series.mode)
df = df.merge(modes, on=['col1', 'col2'], how='left')

How do I filter out random sample from dataframe in which there are different sample size for each value, in python?

I have to filter out random sample from Data on which:
'a' should have 6 values,
'b' should have 4 values and
'c' should have 7 values randomly.
Data Value
a 1
a 2
a 3
a 4
a 5
a 6
a 7
a 8
a 9
a 10
b 1
b 2
b 3
b 4
b 5
b 6
b 7
b 8
b 9
c 1
c 2
c 3
c 4
c 5
c 6
c 7
c 8
I want output as:
Data Value
a 3
a 5
a 7
a 2
a 4
a 9
b 3
b 5
b 7
b 8
c 1
c 3
c 4
c 5
c 6
c 7
c 9
First define counts of samples for each group and then groupby with sample:
d = {'a':6, 'b':4, 'c':7}
df = df.groupby('Data', group_keys=False).apply(lambda x: x.sample(d[x.name]))
print (df)
Data Value
7 a 8
5 a 6
0 a 1
2 a 3
9 a 10
8 a 9
17 b 8
18 b 9
15 b 6
14 b 5
22 c 4
23 c 5
25 c 7
21 c 3
20 c 2
24 c 6
19 c 1
Another aproach with filtering only values of matched of keys of dict:
d = {'a':6, 'b':4, 'c':7}
df = pd.concat([df[df['Data'].eq(k)].sample(v) for k, v in d.items()], ignore_index=True)

Pandas sorting multiple columns

I have the following dataframe
A B
b 10
b 5
a 25
a 5
c 6
c 2
b 20
a 10
c 4
c 3
b 15
How can I sort it as follows:
A B
b 20
b 15
b 10
b 5
a 25
a 10
a 5
c 6
c 4
c 3
c 2
Column A is sorted based on the sum of corresponding values in column B in descending order(The sums are b-50, a-40, c-15) .
Create a temporary column _t and sort using sort_values on _t, B
In [269]: (df.assign(_t=df['A'].map(df.groupby('A')['B'].sum()))
.sort_values(by=['_t', 'B'], ascending=False)
.drop('_t', 1))
Out[269]:
A B
6 b 20
10 b 15
0 b 10
1 b 5
2 a 25
7 a 10
3 a 5
4 c 6
8 c 4
9 c 3
5 c 2
Details
In [270]: df.assign(_t=df['A'].map(df.groupby('A')['B'].sum()))
Out[270]:
A B _t
0 b 10 50
1 b 5 50
2 a 25 40
3 a 5 40
4 c 6 15
5 c 2 15
6 b 20 50
7 a 10 40
8 c 4 15
9 c 3 15
10 b 15 50

sort dataframe by position in group then by that group

consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))

Categories