How to do R dplyr's do in pandas? - python

I would like to split my pandas DataFrame into groups and then run a complex function on each chunk. The complex function returns for each chunk a DataFrame with arbitrary number and names of columns and an arbitrary number of rows. I would like those result DataFrame to be combined after the operation. In R I am able to do this with
library(tibble)
library(dplyr)
df = tribble(
~g, ~c1, ~c2,
"a", 1, 6,
"a", 2, 7,
"b", 3, 8,
"b", 4, 9,
"b", 5, 10
)
myfct <- function(x, y){
data.frame(c1 = x,
c2 = y,
res = c(x * y, x + y, x / y),
type = c('mult', 'add', 'div'))
}
df %>% group_by(g) %>% do(myfct(.$c1, .$c2))
with the result being
Source: local data frame [15 x 5]
Groups: g [2]
g c1 c2 res type
<chr> <dbl> <dbl> <dbl> <fctr>
1 a 1 6 6.0000000 mult
2 a 2 7 14.0000000 add
3 a 1 6 7.0000000 div
4 a 2 7 9.0000000 mult
5 a 1 6 0.1666667 add
6 a 2 7 0.2857143 div
7 b 3 8 24.0000000 mult
8 b 4 9 36.0000000 add
9 b 5 10 50.0000000 div
10 b 3 8 11.0000000 mult
11 b 4 9 13.0000000 add
12 b 5 10 15.0000000 div
13 b 3 8 0.3750000 mult
14 b 4 9 0.4444444 add
15 b 5 10 0.5000000 div
This - of course - is only an example.

I think you need apply - check also flexible apply:
def myfct(x):
print (x)
return pd.DataFrame({'mult':x['c1'] * x['c2'],
'add':x['c1'] + x['c2'],
'div':x['c1'] / x['c2'],
'g':x.name,
'c1': x['c1'],
'c2':x['c2']})
df = df.groupby('g')['c1','c2'].apply(myfct)
print (df)
add c1 c2 div g mult
0 7 1 6 0.166667 a 6
1 9 2 7 0.285714 a 14
2 11 3 8 0.375000 b 24
3 13 4 9 0.444444 b 36
4 15 5 10 0.500000 b 50
Also for reshape is possible use melt:
df = df.groupby('g')['c1','c2'].apply(myfct)
.melt(id_vars=['g','c1','c2'], value_name='res', var_name='type')
print (df)
g c1 c2 type res
0 a 1 6 add 7.000000
1 a 2 7 add 9.000000
2 b 3 8 add 11.000000
3 b 4 9 add 13.000000
4 b 5 10 add 15.000000
5 a 1 6 div 0.166667
6 a 2 7 div 0.285714
7 b 3 8 div 0.375000
8 b 4 9 div 0.444444
9 b 5 10 div 0.500000
10 a 1 6 mult 6.000000
11 a 2 7 mult 14.000000
12 b 3 8 mult 24.000000
13 b 4 9 mult 36.000000
14 b 5 10 mult 50.000000

Related

Python Dataframe - Get max value between specific number vs. column value

When I have a below df, I want to get a column 'C' which has max value between specific value '15' and column 'A' within the condition "B == 't'"
testdf = pd.DataFrame({"A":[20, 16, 7, 3, 8],"B":['t','t','t','t','f']})
testdf
A B
0 20 t
1 16 t
2 7 t
3 3 t
4 8 f
I tried this:
testdf.loc[testdf['B']=='t', 'C'] = max(15,(testdf.loc[testdf['B']=='t','A']))
And desired output is:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
Could you help me to get the output? Thank you!
Use np.where with clip:
testdf['C'] = np.where(testdf['B'].eq('t'),
testdf['A'].clip(15), df['A'])
Or similarly with series.where:
testdf['C'] = (testdf['A'].clip(15)
.where(testdf['B'].eq('t'), testdf['A'])
)
output:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
You could also use the update method:
testdf['C'] = testdf['A']
A B C
0 20 t 20
1 16 t 16
2 7 t 7
3 3 t 3
4 8 f 8
values = testdf.A[testdf.B.eq('t')].clip(15)
values
Out[16]:
0 20
1 16
2 15
3 15
Name: A, dtype: int64
testdf.update(values.rename('C'))
A B C
0 20 t 20.0
1 16 t 16.0
2 7 t 15.0
3 3 t 15.0
4 8 f 8.0
To apply any formula to individual values in a dataframe you can use
df['column'] =df['column'].apply(lambda x: anyFunc(x))
x here will catch individual values of column one by one and pass it to the function where you can manipulate it and return back.

In pandas groupby mode use user defined function, apply it to multiple columns and assign the results to new pandas columns

I have a following data set:
> dt
a b group
1: 1 5 a
2: 2 6 a
3: 3 7 b
4: 4 8 b
I have a following function:
def bigSum(a,b):
return(a.min() + b.max())
I want to apply this function to a and b columns in groupby mode (by group) and assign it to the new column c of the data frame. My wished result is
> dt
a b group c
1: 1 5 a 7
2: 2 6 a 7
3: 3 7 b 11
4: 4 8 b 11
For instance, if I would have used R data.table, I would do the following:
dt[, c := bigSum(a,b), by = group]
and it would work exactly as I expect. I am interested if there is something similar in pandas.
In pandas we have transform
g = df.groupby('group')
df['out'] = g.a.transform('min') + g.b.transform('max')
df
Out[282]:
a b group out
1 1 5 a 7
2 2 6 a 7
3 3 7 b 11
4 4 8 b 11
Update
df['new'] = df.groupby('group').apply(lambda x : bigSum(x['a'],x['b'])).reindex(df.group).values
df
Out[287]:
a b group out new
1 1 5 a 7 7
2 2 6 a 7 7
3 3 7 b 11 11
4 4 8 b 11 11

How to replace values in selected rows columns with an array in dataframe?

I have a df
df = pd.DataFrame(data={'A': [1,2,3,4,5,6,7,8],
'B': [10,20,30,40,50,60,70,80]})
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
5 6 60
6 7 70
7 8 80
which I selected a few rows from.
Then I have a dictionary containig values that I should insert in B column
if key matches with value in A column of df
my_dict = {2: 39622884,
4: 82709546,
5: 28166511,
7: 89465652}
When I use the following assignment
df.loc[df['A'].isin(my_dict.keys())]['B'] = list(my_dict.values())
I get the error:
ValueError: Length of values does not match length of index
The desirable output is
A B
0 1 10
1 2 39622884
2 3 30
3 4 82709546
4 5 50
5 6 28166511
6 7 89465652
7 8 80
What is the correct way to implement this procedure?
You can make do with map and fillna:
df['B'] = df['A'].map(my_dict).fillna(df['B'])
Output:
A B
0 1 10.0
1 2 39622884.0
2 3 30.0
3 4 82709546.0
4 5 28166511.0
5 6 60.0
6 7 89465652.0
7 8 80.0

Rolling sum in subgroups of a dataframe (pandas)

I have sessions dataframe that contains E-mail and Sessions (int) columns.
I need to calculate rolling sum of sessions per email (i.e. not globally).
Now, the following works, but it's painfully slow:
emails = set(list(sessions['E-mail']))
ses_sums = []
for em in emails:
email_sessions = sessions[sessions['E-mail'] == em]
email_sessions.is_copy = False
email_sessions['Session_Rolling_Sum'] = pd.rolling_sum(email_sessions['Sessions'], window=self.window).fillna(0)
ses_sums.append(email_sessions)
df = pd.concat(ses_sums, ignore_index=True)
Is there a way of achieving the same in pandas, but using pandas operators on a dataframe instead of creating separate dataframes for each email and then concatenating them?
(either that or some other way of making this faster)
Setup
np.random.seed([3,1415])
df = pd.DataFrame({'E-Mail': np.random.choice(list('AB'), 20),
'Session': np.random.randint(1, 10, 20)})
Solution
The current and proper way to do this is with rolling.sum that can b used on the result of a pd.Series group by object.
# Series Group By
# /------------------------\
df.groupby('E-Mail').Session.rolling(3).sum()
# \--------------/
# Method you want
E-Mail
A 0 NaN
2 NaN
4 11.0
5 7.0
7 10.0
12 16.0
15 16.0
17 16.0
18 17.0
19 18.0
B 1 NaN
3 NaN
6 18.0
8 14.0
9 16.0
10 12.0
11 13.0
13 16.0
14 20.0
16 22.0
Name: Session, dtype: float64
Details
df
E-Mail Session
0 A 9
1 B 7
2 A 1
3 B 3
4 A 1
5 A 5
6 B 8
7 A 4
8 B 3
9 B 5
10 B 4
11 B 4
12 A 7
13 B 8
14 B 8
15 A 5
16 B 6
17 A 4
18 A 8
19 A 6
Say you start with
In [58]: df = pd.DataFrame({'E-Mail': ['foo'] * 3 + ['bar'] * 3 + ['foo'] * 3, 'Session': range(9)})
In [59]: df
Out[59]:
E-Mail Session
0 foo 0
1 foo 1
2 foo 2
3 bar 3
4 bar 4
5 bar 5
6 foo 6
7 foo 7
8 foo 8
In [60]: df[['Session']].groupby(df['E-Mail']).apply(pd.rolling_sum, 3)
Out[60]:
Session
E-Mail
bar 3 NaN
4 NaN
5 12.0
foo 0 NaN
1 NaN
2 3.0
6 9.0
7 15.0
8 21.0
Incidentally, note that I just rearranged your rolling_sum, but it has been deprecated - you should now use rolling:
df[['Session']].groupby(df['E-Mail']).apply(lambda g: g.rolling(3).sum())

Python Pandas: Get 2 set of random samples per group

I have a pandas DataFrame say this:
user value
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 6
6 b 7
7 b 8
8 b 9
9 b 10
10 c 11
11 c 12
12 c 13
13 c 14
14 c 15
Now I want to group by user, and create two mutually exclusive random samples out of it e.g
Set1 with 1 samples per group:
user value
3 a 4
9 b 10
13 c 14
Set2 with 2 samples per group:
user value
0 a 1
1 a 2
5 b 6
6 b 7
10 c 11
11 c 12
So far i'v tried this:
u = np.array(['a','b','c'])
u = np.repeat(u,5)
df = pd.DataFrame({'user':u,'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
set1 = df.groupby(['user']).tail(1)
set2 = df.groupby(['user']).head(2)
But these are not random samples, and i would like them to be mutually exclusive. Any ideas?
PS. Each group always has at least 3 elements
You can randomly select 3 records for each user:
a = df.groupby("user")["value"].apply(lambda x: x.sample(3))
a
Out[27]:
user
a 3 4
0 1
2 3
b 5 6
7 8
6 7
c 14 15
10 11
13 14
dtype: int64
And assign first one to the first set, the remaining two to the second set:
a.groupby(level=0).head(1)
Out[28]:
user
a 3 4
b 5 6
c 14 15
dtype: int64
a.groupby(level=0).tail(2)
Out[29]:
user
a 0 1
2 3
b 7 8
6 7
c 10 11
13 14
dtype: int64
This maybe a bit naive but all I did was reindex the DataFrame with a random permutation of the length of the DataFrame and reset the index. After that I take the head and tail as you did with your original code, seems to work. This could probably be made into a function:
a = np.arange(len(df))
np.random.shuffle(a)
df = df.reindex(a).reset_index()
set1 = df.groupby(['user']).tail(1)
>>>
index user value
12 9 b 10
13 10 c 11
14 1 a 2
set2 = df.groupby(['user']).head(2)
>>>
index user value
0 6 b 7
1 2 a 3
2 5 b 6
3 13 c 14
4 3 a 4
6 12 c 13
Hope this helps.
There is likely a better solution but what about just randomizing your data before grouping and then taking the tail and head per group? You could take a set of your indices, take a random permutation of it and use that to create a new scrambled dataframe, then do your current procedure.

Categories