groupby/eq compute mean of specific column - python

Im trying to figure out to how to use groupby/eq to computer the mean of specific column, i have a df as seen below (original df).
I would like to groupby 'Group' and 'players' with class equals 1 and get the mean of the 'score'.
example:
Group = a, players =2
(16+13+19)/3 = 16
+-------+---------+-------+-------+------------+
| Group | players | class | score | score_mean |
+-------+---------+-------+-------+------------+
| a | 2 | 2 | 14 | |
| a | 2 | 1 | 16 | 16 |
| a | 2 | 1 | 13 | 16 |
| a | 2 | 2 | 13 | |
| a | 2 | 1 | 19 | 16 |
| a | 2 | 2 | 17 | |
| a | 2 | 2 | 14 | |
+-------+---------+-------+-------+------------+
i've tried:
df['score_mean'] = df['class'].eq(1).groupby(['Group', 'players'])['score'].transform('mean')
but i kept getting "KeyError"
original df:
+----+-------+---------+-------+-------+
| | Group | players | class | score |
+----+-------+---------+-------+-------+
| 0 | a | 1 | 1 | 10 |
| 1 | c | 2 | 1 | 20 |
| 2 | a | 1 | 3 | 29 |
| 3 | c | 1 | 3 | 22 |
| 4 | a | 2 | 2 | 14 |
| 5 | b | 1 | 2 | 16 |
| 6 | a | 2 | 1 | 16 |
| 7 | b | 2 | 3 | 17 |
| 8 | c | 1 | 2 | 22 |
| 9 | b | 1 | 2 | 23 |
| 10 | c | 2 | 2 | 22 |
| 11 | d | 1 | 1 | 13 |
| 12 | a | 2 | 1 | 13 |
| 13 | d | 1 | 3 | 23 |
| 14 | a | 2 | 2 | 13 |
| 15 | d | 2 | 1 | 34 |
| 16 | b | 1 | 3 | 32 |
| 17 | c | 2 | 2 | 29 |
| 18 | b | 2 | 2 | 28 |
| 19 | a | 2 | 1 | 19 |
| 20 | a | 1 | 1 | 19 |
| 21 | c | 1 | 1 | 27 |
| 22 | b | 1 | 3 | 47 |
| 23 | a | 2 | 2 | 17 |
| 24 | c | 1 | 1 | 14 |
| 25 | c | 2 | 2 | 25 |
| 26 | a | 1 | 3 | 67 |
| 27 | b | 2 | 3 | 21 |
| 28 | a | 1 | 3 | 27 |
| 29 | c | 1 | 1 | 16 |
| 30 | a | 2 | 2 | 14 |
| 31 | b | 1 | 2 | 25 |
+----+-------+---------+-------+-------+
data = {'Group':['a','c','a','c','a','b','a','b','c','b','c','d','a','d','a','d',
'b','c','b','a','a','c','b','a','c','c','a','b','a','c','a','b'],
'players':[1,2,1,1,2,1,2,2,1,1,2,1,2,1,2,2,1,2,2,2,1,1,1,2,1,2,1,2,1,1,2,1],
'class':[1,1,3,3,2,2,1,3,2,2,2,1,1,3,2,1,3,2,2,1,1,1,3,2,1,2,3,3,3,1,2,2],
'score':[10,20,29,22,14,16,16,17,22,23,22,13,13,23,13,34,32,29,28,19,19,27,47,17,14,25,67,21,27,16,14,25]}
df = pd.DataFrame(data)
kindly advice
Many thanks & Regards

Try:
Via set_index(),groupby() ,assign() and reset_index() method:
df=(df.set_index(['Group','players'])
.assign(score_mean=df[df['class'].eq(1)].groupby(['Group', 'players'])['score'].mean())
.reset_index())
Update:
If you want the first df as your output then use:
grouped=df.groupby(['Group', 'players','class']).transform('mean')
grouped=grouped.assign(players=df['players'],Group=df['Group'],Class=df['class']).where(df['Group']=='a').dropna()
grouped['score']=grouped.apply(lambda x:float('NaN') if x['players']==1 else x['score'],1)
grouped=grouped.dropna(subset=['score'])
Now if you print grouped you will get your desired output

If I got you right, need values returned only where class=1. Not sure what that will serve but code below. Use groupby transform and chain where
df['score_mean']=df.groupby(['Group','players'])['score'].transform('mean').where(df['class']==1).fillna('')
Group players class score score_mean
0 a 1 1 10 10
1 a 2 1 20 20
2 a 3 5 29
3 a 4 5 22
4 a 5 5 14
5 b 1 7 16
6 b 2 7 16
7 b 3 7 17
8 c 1 4 22
9 c 2 2 23
10 c 3 2 22
11 d 1 4 13
12 d 2 4 13
13 d 3 3 23
14 d 4 8 13
15 d 5 7 34
16 e 1 7 32
17 e 2 2 29
18 e 3 2 28
19 e 4 1 19 19
20 e 5 1 19 19
21 e 6 1 27 27
22 f 1 5 47
23 f 2 5 17
24 f 3 7 14
25 f 4 7 25
26 g 1 3 67
27 g 2 3 21
28 g 3 3 27
29 g 4 8 16
30 g 5 8 14
31 g 6 8 25

You could first filter by class and then create score_mean by doing a groupby and transform.
(
df[df['class']==1]
.assign(score_mean = lambda x: x.groupby(['Group', 'players']).score.transform('mean'))
)
Group players class score score_mean
0 a 1 1 10 14.5
1 c 2 1 20 20.0
6 a 2 1 16 16.0
11 d 1 1 13 13.0
12 a 2 1 13 16.0
15 d 2 1 34 34.0
19 a 2 1 19 16.0
20 a 1 1 19 14.5
21 c 1 1 27 19.0
24 c 1 1 14 19.0
29 c 1 1 16 19.0
If you want to keep other classes and set the mean to '', you can do:
(
df[df['class']==1]
.groupby(['Group', 'players']).score.transform('mean')
.pipe(lambda x: df.assign(score_mean = x))
.fillna('')
)

Related

determine chain of predecessors and successor from a list of first predecessor in python

I have a list like the following
+----+-------------------+
| id | first_predecessor |
+----+-------------------+
| 0 | 4 |
| 1 | 5 |
| 2 | 6 |
| 3 | 17,18 |
| 4 | 7 |
| 5 | 8 |
| 6 | 9 |
| 7 | 10,11,12 |
| 8 | 13,14,15 |
| 9 | 16 |
| 10 | Input |
| 11 | Input |
| 12 | Input |
| 13 | Input |
| 14 | Input |
| 15 | Input |
| 16 | Input |
| 17 | 19 |
| 18 | 20 |
| 19 | 21 |
| 20 | 22 |
| 21 | Input |
+----+-------------------+
One item can have multiple immediate incoming ids, like in case of id=3, which is imediately preceeded by id=17 and id=18.
I need a python code to determine this result by following the chain of predecessors both ways:
(it is best to read the column all_successors to understand the logic, all_predecessors is the same logic backwards)
+----+-------------------+------------------+----------------+
| id | first_predecessor | all_predecessors | all_successors |
+----+-------------------+------------------+----------------+
| 0 | 4 | 4,7,10,11,12 | |
| 1 | 5 | 5,8,13,14,15 | |
| 2 | 6 | 6,9,16 | |
| 3 | 17,18 | 19,21,20,22 | |
| 4 | 7 | 7,10,11,12 | 0 |
| 5 | 8 | 8,13,14,15 | 1 |
| 6 | 9 | 9,16 | 2 |
| 7 | 10,11,12 | 10,11,12 | 0,4 |
| 8 | 13,14,15 | 13,14,15 | 1,5 |
| 9 | 16 | 16 | 2,6 |
| 10 | Input | | 0,4,7 |
| 11 | Input | | 0,4,7 |
| 12 | Input | | 0,4,7 |
| 13 | Input | | 1,5,8 |
| 14 | Input | | 1,5,8 |
| 15 | Input | | 1,5,8 |
| 16 | Input | | 2,6,9 |
| 17 | 19 | 19,21 | 3 |
| 18 | 20 | 20,22 | 3 |
| 19 | 21 | 21 | 3,17 |
| 20 | 22 | 22 | 3,18 |
| 21 | Input | | 3,17,19 |
| 22 | Input | | 3,18,20 |
+----+-------------------+------------------+----------------+
I need some kind of recursive solution, or should I use some graph package?
You can use the following functions to find all predecessors and all successors.
ancestors(G, source): Returns all nodes having a path to source in G.
descendants(G, source): Returns all nodes reachable from source in G.
To run the following example, make sure you change INPUT in your id column to NaN.
df_ = df.copy()
df_['first_predecessor'] = df_['first_predecessor'].str.split(',')
df_ = df_.explode('first_predecessor')
df_['first_predecessor'] = df_['first_predecessor'].fillna(-1).astype(int)
G = nx.from_pandas_edgelist(df_, 'first_predecessor', 'id', create_using=nx.DiGraph())
G.remove_node(-1)
df['all_predecessors'] = df['id'].apply(lambda x: ','.join(map(str, sorted(nx.ancestors(G, x)))))
df['all_successors'] = df['id'].apply(lambda x: ','.join(map(str, sorted(nx.descendants(G, x)))))
print(df)
id first_predecessor all_predecessors all_successors
0 0 4 4,7,10,11,12
1 1 5 5,8,13,14,15
2 2 6 6,9,16
3 3 17,18 17,18,19,20,21,22
4 4 7 7,10,11,12 0
5 5 8 8,13,14,15 1
6 6 9 9,16 2
7 7 10,11,12 10,11,12 0,4
8 8 13,14,15 13,14,15 1,5
9 9 16 16 2,6
10 10 NaN 0,4,7
11 11 NaN 0,4,7
12 12 NaN 0,4,7
13 13 NaN 1,5,8
14 14 NaN 1,5,8
15 15 NaN 1,5,8
16 16 NaN 2,6,9
17 17 19 19,21 3
18 18 20 20,22 3
19 19 21 21 3,17
20 20 22 22 3,18
21 21 NaN 3,17,19

How to count the occurrence of a value and set that count as a new value for that value's row

Title is probably confusing, but let me make it clearer.
Let's say I have a df like this:
+----+------+---------------+
| Id | Name | reports_to_id |
+----+------+---------------+
| 0 | A | 10 |
| 1 | B | 10 |
| 2 | C | 11 |
| 3 | D | 12 |
| 4 | E | 11 |
| 10 | F | 20 |
| 11 | G | 21 |
| 12 | H | 22 |
+----+------+---------------+
I would want my resulting df to look like this:
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 0 |
| 1 | B | 10 | 0 |
| 2 | C | 11 | 0 |
| 3 | D | 12 | 0 |
| 4 | E | 11 | 0 |
| 10 | F | 20 | 2 |
| 11 | G | 21 | 2 |
| 12 | H | 22 | 1 |
+----+------+---------------+-------+
But this what I currently get as a result of my code (that is wrong):
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 2 |
| 1 | B | 10 | 2 |
| 2 | C | 11 | 2 |
| 3 | D | 12 | 1 |
| 4 | E | 11 | 2 |
| 10 | F | 20 | 0 |
| 11 | G | 21 | 0 |
| 12 | H | 22 | 0 |
+----+------+---------------+-------+
with this code:
df['COUNT'] = df.groupby(['reports_to_id'])['id'].transform('count')
Any suggestions or directions on how to get the result I want? All help is appreciated! and thank you in advance!
Use value_counts to count the reports_to_id by values, then map that to Id:
df['COUNT'] = df['Id'].map(df['reports_to_id'].value_counts()).fillna(0)
Output:
Id Name reports_to_id COUNT
0 0 A 10 0.0
1 1 B 10 0.0
2 2 C 11 0.0
3 3 D 12 0.0
4 4 E 11 0.0
5 10 F 20 2.0
6 11 G 21 2.0
7 12 H 22 1.0
Similar idea with reindex:
df['COUNT'] = df['reports_to_id'].value_counts().reindex(df['Id'], fill_value=0).values
which gives a better looking COUNT:
Id Name reports_to_id COUNT
0 0 A 10 0
1 1 B 10 0
2 2 C 11 0
3 3 D 12 0
4 4 E 11 0
5 10 F 20 2
6 11 G 21 2
7 12 H 22 1
You can try the following:
l=list[df['reports_to_id']
df['Count']=df['Id'].apply(lambda x: l.count(x))

Find top N values within each group

I have a dataset similar to the sample below:
| id | size | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6 | small | 3 | 0 | 21 | 0 |
| 6 | small | 9 | 0 | 23 | 0 |
| 13 | medium | 3 | 0 | 12 | 0 |
| 13 | medium | 37 | 0 | 20 | 1 |
| 20 | medium | 30 | 0 | 5 | 6 |
| 20 | medium | 12 | 2 | 3 | 0 |
| 12 | small | 7 | 0 | 2 | 0 |
| 10 | small | 8 | 0 | 12 | 0 |
| 15 | small | 19 | 0 | 3 | 0 |
| 15 | small | 54 | 0 | 8 | 0 |
| 87 | medium | 6 | 0 | 9 | 0 |
| 90 | medium | 11 | 1 | 16 | 0 |
| 90 | medium | 25 | 0 | 4 | 0 |
| 90 | medium | 10 | 0 | 5 | 0 |
| 9 | large | 8 | 1 | 23 | 0 |
| 9 | large | 19 | 0 | 2 | 0 |
| 1 | large | 1 | 0 | 0 | 0 |
| 50 | large | 34 | 0 | 7 | 0 |
This is the input for above table:
data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])
I want to have an output which will group the dataset on size and would list out top 2 id based on the values of 'new_a' column within each group of size. Since, some of the ids are repeating multiple times, I would want to sum the values of new_a for such ids and then find top 2 values. My final table should look like the one below:
| size | id | new_a |
|--------|----|-------|
| large | 9 | 25 |
| large | 50 | 7 |
| medium | 13 | 32 |
| medium | 90 | 25 |
| small | 6 | 44 |
| small | 10 | 12 |
I have tried the below code but it isn't showing top 2 values of new_a for each group within 'size' column.
nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
df.groupby('size').apply(
lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
).reset_index()[['size', 'id', 'new_a']]
)
Prints:
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can set size, id as the index to avoid double groupby here, and use Series.sum leveraging level parameter.
df.set_index(["size", "id"]).groupby(level=0).apply(
lambda x: x.sum(level=1).nlargest(2)
).reset_index()
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can chain two groupby methods:
data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()
Output:
id size new_a
0 9 large 25
1 50 large 7
2 13 medium 32
3 90 medium 25
4 6 small 44
5 10 small 12

Pandas: sum multiple columns based on similar consecutive numbers in another column

Given the following table
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 3 | 194.92 | 100 | 1 |
| 4 | 194.92 | 52 | 1 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 7 | 194.85 | 900 | 1 |
| 8 | 194.85 | 25 | 1 |
| 9 | 194.85 | 224 | 1 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 12 | 194.6 | 10 | 1 |
| 13 | 194.6 | 25 | 1 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 18 | 195 | 90 | 1 |
| 19 | 195 | 100 | 1 |
| 20 | 195 | 50 | 1 |
| 21 | 195 | 50 | 1 |
| 22 | 195 | 25 | 1 |
| 23 | 195 | 5 | 1 |
| 24 | 195 | 500 | 1 |
| 25 | 195 | 100 | 1 |
| 26 | 195.09 | 100 | 1 |
| 27 | 195 | 120 | 1 |
| 28 | 195 | 60 | 1 |
| 29 | 195 | 40 | 1 |
| 30 | 195 | 10 | 1 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 33 | 194.81 | 20 | 1 |
| 34 | 194.81 | 50 | 1 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
For faster testing you can also find here the same table in a pandas dataframe
pd_data_before = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[3,194.92,100,1],[4,194.92,52,1],[5,194.9,99,1],[6,194.86,74,1],[7,194.85,900,1],[8,194.85,25,1],[9,194.85,224,1],[10,194.6,101,1],[11,194.85,19,1],[12,194.6,10,1],[13,194.6,25,1],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[18,195,90,1],[19,195,100,1],[20,195,50,1],[21,195,50,1],[22,195,25,1],[23,195,5,1],[24,195,500,1],[25,195,100,1],[26,195.09,100,1],[27,195,120,1],[28,195,60,1],[29,195,40,1],[30,195,10,1],[31,194.6,1,1],[32,194.99,1,1],[33,194.81,20,1],[34,194.81,50,1],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
The question is how do we sum up the volume and transactions based on similar consecutive prices? The end result would be something like this:
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 4 | 194.92 | 152 | 2 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 9 | 194.85 | 1149 | 3 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 13 | 194.6 | 35 | 2 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 25 | 195 | 920 | 8 |
| 26 | 195.09 | 100 | 1 |
| 30 | 195 | 230 | 4 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 34 | 194.81 | 70 | 2 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
You can also find the result ready made in a pandas dataframe below:
pd_data_after = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[4,194.92,152,2],[5,194.9,99,1],[6,194.86,74,1],[9,194.85,1149,3],[10,194.6,101,1],[11,194.85,19,1],[13,194.6,35,2],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[25,195,920,8],[26,195.09,100,1],[30,195,230,4],[31,194.6,1,1],[32,194.99,1,1],[34,194.81,70,2],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
I managed to achieve this in a for loop. But the problem is that it is very slow when iterating each row. My data set is huge, around 50 million rows.
Is there any way to achieve this without looping?
A common trick to groupby consecutive values is the following:
df.col.ne(df.col.shift()).cumsum()
We can use that here, then use agg to keep the first values of the columns we aren't summing, and to sum the values we do want to sum.
(df.groupby(df.Price.ne(df.Price.shift()).cumsum())
.agg({'Nr': 'last', 'Price': 'first', 'Volume':'sum', 'Transactions': 'sum'})
).reset_index(drop=True)
Nr Price Volume Transactions
0 1 194.60 100 1
1 2 195.00 10 1
2 4 194.92 152 2
3 5 194.90 99 1
4 6 194.86 74 1
5 9 194.85 1149 3
6 10 194.60 101 1
7 11 194.85 19 1
8 13 194.60 35 2
9 14 194.53 12 1
10 15 194.85 14 1
11 16 194.60 11 1
12 17 194.85 93 1
13 25 195.00 920 8
14 26 195.09 100 1
15 30 195.00 230 4
16 31 194.60 1 1
17 32 194.99 1 1
18 34 194.81 70 2
19 35 194.97 17 1
20 36 194.99 25 1
21 37 195.00 75 1

Python: Running maximum by another column?

I have a dataframe like this, which tracks the value of certain items (ids) over time:
mytime=np.tile( np.arange(0,10) , 2 )
myids=np.repeat( [123,456], [10,10] )
myvalues=np.random.random_integers(20,30,10*2)
df=pd.DataFrame()
df['myids']=myids
df['mytime']=mytime
df['myvalues']=myvalues
+-------+--------+----------+--+--+
| myids | mytime | myvalues | | |
+-------+--------+----------+--+--+
| 123 | 0 | 29 | | |
+-------+--------+----------+--+--+
| 123 | 1 | 23 | | |
+-------+--------+----------+--+--+
| 123 | 2 | 26 | | |
+-------+--------+----------+--+--+
| 123 | 3 | 24 | | |
+-------+--------+----------+--+--+
| 123 | 4 | 25 | | |
+-------+--------+----------+--+--+
| 123 | 5 | 29 | | |
+-------+--------+----------+--+--+
| 123 | 6 | 28 | | |
+-------+--------+----------+--+--+
| 123 | 7 | 21 | | |
+-------+--------+----------+--+--+
| 123 | 8 | 20 | | |
+-------+--------+----------+--+--+
| 123 | 9 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 0 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 1 | 24 | | |
+-------+--------+----------+--+--+
| 456 | 2 | 20 | | |
+-------+--------+----------+--+--+
| 456 | 3 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 4 | 29 | | |
+-------+--------+----------+--+--+
| 456 | 5 | 29 | | |
+-------+--------+----------+--+--+
| 456 | 6 | 24 | | |
+-------+--------+----------+--+--+
| 456 | 7 | 21 | | |
+-------+--------+----------+--+--+
| 456 | 8 | 27 | | |
+-------+--------+----------+--+--+
| 456 | 9 | 29 | | |
+-------+--------+----------+--+--+
I'd need to calculate the running maximum for each id.
np.maximum.accumulate()
would calculate the running maximum regardless of id, whereas I need a similar calculation, which however resets every time the id changes. I can think of a simple script to do it in numba (I have very large arrays and non-vectorised non-numba code would be slow), but is there an easier way to do it?
With just two values I can run:
df['running max']= np.hstack(( np.maximum.accumulate(df[ df['myids']==123 ]['myvalues']) , np.maximum.accumulate(df[ df['myids']==456 ]['myvalues']) ) )
but this is not feasible with lots and lots of values.
Thanks!
Here you go. Assumption is mytime is sorted.
mytime=np.tile( np.arange(0,10) , 2 )
myids=np.repeat( [123,456], [10,10] )
myvalues=np.random.random_integers(20,30,10*2)
df=pd.DataFrame()
df['myids']=myids
df['mytime']=mytime
df['myvalues']=myvalues
groups = df.groupby('myids')
df['run_max_group'] = groups['myvalues'].transform(np.maximum.accumulate)
Output...
myids mytime myvalues run_max_group
0 123 0 27 27
1 123 1 21 27
2 123 2 24 27
3 123 3 25 27
4 123 4 22 27
5 123 5 20 27
6 123 6 20 27
7 123 7 30 30
8 123 8 24 30
9 123 9 22 30
10 456 0 29 29
11 456 1 23 29
12 456 2 30 30
13 456 3 28 30
14 456 4 26 30
15 456 5 25 30
16 456 6 28 30
17 456 7 27 30
18 456 8 20 30
19 456 9 24 30
It seems that it is indeed not too difficult
byid = df.groupby('myid')
rmax = byid['myvalues].cummax()
for k, indices in byid.indices.items():
print 'myid = %s' % k
print 'running max = %s' % rmax[indices]
I have (almost) no previous pandas, but using ipython as an exploratory instrument I was able to find a solution. I recommend the use of ipython to explore large and complex libraries.
p.s. re my previous comment: no need for axis=

Categories