I would like to split my pandas DataFrame into groups and then run a complex function on each chunk. The complex function returns for each chunk a DataFrame with arbitrary number and names of columns and an arbitrary number of rows. I would like those result DataFrame to be combined after the operation. In R I am able to do this with
library(tibble)
library(dplyr)
df = tribble(
~g, ~c1, ~c2,
"a", 1, 6,
"a", 2, 7,
"b", 3, 8,
"b", 4, 9,
"b", 5, 10
)
myfct <- function(x, y){
data.frame(c1 = x,
c2 = y,
res = c(x * y, x + y, x / y),
type = c('mult', 'add', 'div'))
}
df %>% group_by(g) %>% do(myfct(.$c1, .$c2))
with the result being
Source: local data frame [15 x 5]
Groups: g [2]
g c1 c2 res type
<chr> <dbl> <dbl> <dbl> <fctr>
1 a 1 6 6.0000000 mult
2 a 2 7 14.0000000 add
3 a 1 6 7.0000000 div
4 a 2 7 9.0000000 mult
5 a 1 6 0.1666667 add
6 a 2 7 0.2857143 div
7 b 3 8 24.0000000 mult
8 b 4 9 36.0000000 add
9 b 5 10 50.0000000 div
10 b 3 8 11.0000000 mult
11 b 4 9 13.0000000 add
12 b 5 10 15.0000000 div
13 b 3 8 0.3750000 mult
14 b 4 9 0.4444444 add
15 b 5 10 0.5000000 div
This - of course - is only an example.
I think you need apply - check also flexible apply:
def myfct(x):
print (x)
return pd.DataFrame({'mult':x['c1'] * x['c2'],
'add':x['c1'] + x['c2'],
'div':x['c1'] / x['c2'],
'g':x.name,
'c1': x['c1'],
'c2':x['c2']})
df = df.groupby('g')['c1','c2'].apply(myfct)
print (df)
add c1 c2 div g mult
0 7 1 6 0.166667 a 6
1 9 2 7 0.285714 a 14
2 11 3 8 0.375000 b 24
3 13 4 9 0.444444 b 36
4 15 5 10 0.500000 b 50
Also for reshape is possible use melt:
df = df.groupby('g')['c1','c2'].apply(myfct)
.melt(id_vars=['g','c1','c2'], value_name='res', var_name='type')
print (df)
g c1 c2 type res
0 a 1 6 add 7.000000
1 a 2 7 add 9.000000
2 b 3 8 add 11.000000
3 b 4 9 add 13.000000
4 b 5 10 add 15.000000
5 a 1 6 div 0.166667
6 a 2 7 div 0.285714
7 b 3 8 div 0.375000
8 b 4 9 div 0.444444
9 b 5 10 div 0.500000
10 a 1 6 mult 6.000000
11 a 2 7 mult 14.000000
12 b 3 8 mult 24.000000
13 b 4 9 mult 36.000000
14 b 5 10 mult 50.000000
Related
When I have a below df, I want to get a column 'C' which has max value between specific value '15' and column 'A' within the condition "B == 't'"
testdf = pd.DataFrame({"A":[20, 16, 7, 3, 8],"B":['t','t','t','t','f']})
testdf
A B
0 20 t
1 16 t
2 7 t
3 3 t
4 8 f
I tried this:
testdf.loc[testdf['B']=='t', 'C'] = max(15,(testdf.loc[testdf['B']=='t','A']))
And desired output is:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
Could you help me to get the output? Thank you!
Use np.where with clip:
testdf['C'] = np.where(testdf['B'].eq('t'),
testdf['A'].clip(15), df['A'])
Or similarly with series.where:
testdf['C'] = (testdf['A'].clip(15)
.where(testdf['B'].eq('t'), testdf['A'])
)
output:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
You could also use the update method:
testdf['C'] = testdf['A']
A B C
0 20 t 20
1 16 t 16
2 7 t 7
3 3 t 3
4 8 f 8
values = testdf.A[testdf.B.eq('t')].clip(15)
values
Out[16]:
0 20
1 16
2 15
3 15
Name: A, dtype: int64
testdf.update(values.rename('C'))
A B C
0 20 t 20.0
1 16 t 16.0
2 7 t 15.0
3 3 t 15.0
4 8 f 8.0
To apply any formula to individual values in a dataframe you can use
df['column'] =df['column'].apply(lambda x: anyFunc(x))
x here will catch individual values of column one by one and pass it to the function where you can manipulate it and return back.
I have a following data set:
> dt
a b group
1: 1 5 a
2: 2 6 a
3: 3 7 b
4: 4 8 b
I have a following function:
def bigSum(a,b):
return(a.min() + b.max())
I want to apply this function to a and b columns in groupby mode (by group) and assign it to the new column c of the data frame. My wished result is
> dt
a b group c
1: 1 5 a 7
2: 2 6 a 7
3: 3 7 b 11
4: 4 8 b 11
For instance, if I would have used R data.table, I would do the following:
dt[, c := bigSum(a,b), by = group]
and it would work exactly as I expect. I am interested if there is something similar in pandas.
In pandas we have transform
g = df.groupby('group')
df['out'] = g.a.transform('min') + g.b.transform('max')
df
Out[282]:
a b group out
1 1 5 a 7
2 2 6 a 7
3 3 7 b 11
4 4 8 b 11
Update
df['new'] = df.groupby('group').apply(lambda x : bigSum(x['a'],x['b'])).reindex(df.group).values
df
Out[287]:
a b group out new
1 1 5 a 7 7
2 2 6 a 7 7
3 3 7 b 11 11
4 4 8 b 11 11
I have a df
df = pd.DataFrame(data={'A': [1,2,3,4,5,6,7,8],
'B': [10,20,30,40,50,60,70,80]})
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
5 6 60
6 7 70
7 8 80
which I selected a few rows from.
Then I have a dictionary containig values that I should insert in B column
if key matches with value in A column of df
my_dict = {2: 39622884,
4: 82709546,
5: 28166511,
7: 89465652}
When I use the following assignment
df.loc[df['A'].isin(my_dict.keys())]['B'] = list(my_dict.values())
I get the error:
ValueError: Length of values does not match length of index
The desirable output is
A B
0 1 10
1 2 39622884
2 3 30
3 4 82709546
4 5 50
5 6 28166511
6 7 89465652
7 8 80
What is the correct way to implement this procedure?
You can make do with map and fillna:
df['B'] = df['A'].map(my_dict).fillna(df['B'])
Output:
A B
0 1 10.0
1 2 39622884.0
2 3 30.0
3 4 82709546.0
4 5 28166511.0
5 6 60.0
6 7 89465652.0
7 8 80.0
I have sessions dataframe that contains E-mail and Sessions (int) columns.
I need to calculate rolling sum of sessions per email (i.e. not globally).
Now, the following works, but it's painfully slow:
emails = set(list(sessions['E-mail']))
ses_sums = []
for em in emails:
email_sessions = sessions[sessions['E-mail'] == em]
email_sessions.is_copy = False
email_sessions['Session_Rolling_Sum'] = pd.rolling_sum(email_sessions['Sessions'], window=self.window).fillna(0)
ses_sums.append(email_sessions)
df = pd.concat(ses_sums, ignore_index=True)
Is there a way of achieving the same in pandas, but using pandas operators on a dataframe instead of creating separate dataframes for each email and then concatenating them?
(either that or some other way of making this faster)
Setup
np.random.seed([3,1415])
df = pd.DataFrame({'E-Mail': np.random.choice(list('AB'), 20),
'Session': np.random.randint(1, 10, 20)})
Solution
The current and proper way to do this is with rolling.sum that can b used on the result of a pd.Series group by object.
# Series Group By
# /------------------------\
df.groupby('E-Mail').Session.rolling(3).sum()
# \--------------/
# Method you want
E-Mail
A 0 NaN
2 NaN
4 11.0
5 7.0
7 10.0
12 16.0
15 16.0
17 16.0
18 17.0
19 18.0
B 1 NaN
3 NaN
6 18.0
8 14.0
9 16.0
10 12.0
11 13.0
13 16.0
14 20.0
16 22.0
Name: Session, dtype: float64
Details
df
E-Mail Session
0 A 9
1 B 7
2 A 1
3 B 3
4 A 1
5 A 5
6 B 8
7 A 4
8 B 3
9 B 5
10 B 4
11 B 4
12 A 7
13 B 8
14 B 8
15 A 5
16 B 6
17 A 4
18 A 8
19 A 6
Say you start with
In [58]: df = pd.DataFrame({'E-Mail': ['foo'] * 3 + ['bar'] * 3 + ['foo'] * 3, 'Session': range(9)})
In [59]: df
Out[59]:
E-Mail Session
0 foo 0
1 foo 1
2 foo 2
3 bar 3
4 bar 4
5 bar 5
6 foo 6
7 foo 7
8 foo 8
In [60]: df[['Session']].groupby(df['E-Mail']).apply(pd.rolling_sum, 3)
Out[60]:
Session
E-Mail
bar 3 NaN
4 NaN
5 12.0
foo 0 NaN
1 NaN
2 3.0
6 9.0
7 15.0
8 21.0
Incidentally, note that I just rearranged your rolling_sum, but it has been deprecated - you should now use rolling:
df[['Session']].groupby(df['E-Mail']).apply(lambda g: g.rolling(3).sum())
I have a pandas DataFrame say this:
user value
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 6
6 b 7
7 b 8
8 b 9
9 b 10
10 c 11
11 c 12
12 c 13
13 c 14
14 c 15
Now I want to group by user, and create two mutually exclusive random samples out of it e.g
Set1 with 1 samples per group:
user value
3 a 4
9 b 10
13 c 14
Set2 with 2 samples per group:
user value
0 a 1
1 a 2
5 b 6
6 b 7
10 c 11
11 c 12
So far i'v tried this:
u = np.array(['a','b','c'])
u = np.repeat(u,5)
df = pd.DataFrame({'user':u,'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
set1 = df.groupby(['user']).tail(1)
set2 = df.groupby(['user']).head(2)
But these are not random samples, and i would like them to be mutually exclusive. Any ideas?
PS. Each group always has at least 3 elements
You can randomly select 3 records for each user:
a = df.groupby("user")["value"].apply(lambda x: x.sample(3))
a
Out[27]:
user
a 3 4
0 1
2 3
b 5 6
7 8
6 7
c 14 15
10 11
13 14
dtype: int64
And assign first one to the first set, the remaining two to the second set:
a.groupby(level=0).head(1)
Out[28]:
user
a 3 4
b 5 6
c 14 15
dtype: int64
a.groupby(level=0).tail(2)
Out[29]:
user
a 0 1
2 3
b 7 8
6 7
c 10 11
13 14
dtype: int64
This maybe a bit naive but all I did was reindex the DataFrame with a random permutation of the length of the DataFrame and reset the index. After that I take the head and tail as you did with your original code, seems to work. This could probably be made into a function:
a = np.arange(len(df))
np.random.shuffle(a)
df = df.reindex(a).reset_index()
set1 = df.groupby(['user']).tail(1)
>>>
index user value
12 9 b 10
13 10 c 11
14 1 a 2
set2 = df.groupby(['user']).head(2)
>>>
index user value
0 6 b 7
1 2 a 3
2 5 b 6
3 13 c 14
4 3 a 4
6 12 c 13
Hope this helps.
There is likely a better solution but what about just randomizing your data before grouping and then taking the tail and head per group? You could take a set of your indices, take a random permutation of it and use that to create a new scrambled dataframe, then do your current procedure.