I have a df where I'd like to have the cumsum be bounded within a range of 0 to 6. Where sum over 6 will be rollover to 0. The adj_cumsum column is what I'm trying to get. I've search and found a couple of posts using loops, however, since mine is more straightforward, hence, is wondering whether there is a less complicated or updated approach.
+----+-------+------+----------+----------------+--------+------------+
| | month | days | adj_days | adj_days_shift | cumsum | adj_cumsum |
+----+-------+------+----------+----------------+--------+------------+
| 0 | jan | 31 | 3 | 0 | 0 | 0 |
| 1 | feb | 28 | 0 | 3 | 3 | 3 |
| 2 | mar | 31 | 3 | 0 | 3 | 3 |
| 3 | apr | 30 | 2 | 3 | 6 | 6 |
| 4 | may | 31 | 3 | 2 | 8 | 1 |
| 5 | jun | 30 | 2 | 3 | 11 | 4 |
| 6 | jul | 31 | 3 | 2 | 13 | 6 |
| 7 | aug | 31 | 3 | 3 | 16 | 2 |
| 8 | sep | 30 | 2 | 3 | 19 | 5 |
| 9 | oct | 31 | 3 | 2 | 21 | 0 |
| 10 | nov | 30 | 2 | 3 | 24 | 3 |
| 11 | dec | 31 | 3 | 2 | 26 | 5 |
+----+-------+------+----------+----------------+--------+------------+
data = {"month": ['jan','feb','mar','apr',
'may','jun','jul','aug',
'sep','oct','nov','dec'],
"days": [31,28,31,30,31,30,31,31,30,31,30,31]}
df = pd.DataFrame(data)
df['adj_days'] = df['days'] - 28
df['adj_days_shift'] = df['adj_days'].shift(1)
df['cumsum'] = df.adj_days_shift.cumsum()
df.fillna(0, inplace=True)
Kindly advise
What you are looking for is called a modulo operation.
Use df['adj_cumsum'] = df['cumsum'].mod(7).
Intuition:
df["adj_cumsum"] = df["cumsum"].apply(lambda x:x%7)
Am I right?
I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1
Im trying to figure out to how to use groupby/eq to computer the mean of specific column, i have a df as seen below (original df).
I would like to groupby 'Group' and 'players' with class equals 1 and get the mean of the 'score'.
example:
Group = a, players =2
(16+13+19)/3 = 16
+-------+---------+-------+-------+------------+
| Group | players | class | score | score_mean |
+-------+---------+-------+-------+------------+
| a | 2 | 2 | 14 | |
| a | 2 | 1 | 16 | 16 |
| a | 2 | 1 | 13 | 16 |
| a | 2 | 2 | 13 | |
| a | 2 | 1 | 19 | 16 |
| a | 2 | 2 | 17 | |
| a | 2 | 2 | 14 | |
+-------+---------+-------+-------+------------+
i've tried:
df['score_mean'] = df['class'].eq(1).groupby(['Group', 'players'])['score'].transform('mean')
but i kept getting "KeyError"
original df:
+----+-------+---------+-------+-------+
| | Group | players | class | score |
+----+-------+---------+-------+-------+
| 0 | a | 1 | 1 | 10 |
| 1 | c | 2 | 1 | 20 |
| 2 | a | 1 | 3 | 29 |
| 3 | c | 1 | 3 | 22 |
| 4 | a | 2 | 2 | 14 |
| 5 | b | 1 | 2 | 16 |
| 6 | a | 2 | 1 | 16 |
| 7 | b | 2 | 3 | 17 |
| 8 | c | 1 | 2 | 22 |
| 9 | b | 1 | 2 | 23 |
| 10 | c | 2 | 2 | 22 |
| 11 | d | 1 | 1 | 13 |
| 12 | a | 2 | 1 | 13 |
| 13 | d | 1 | 3 | 23 |
| 14 | a | 2 | 2 | 13 |
| 15 | d | 2 | 1 | 34 |
| 16 | b | 1 | 3 | 32 |
| 17 | c | 2 | 2 | 29 |
| 18 | b | 2 | 2 | 28 |
| 19 | a | 2 | 1 | 19 |
| 20 | a | 1 | 1 | 19 |
| 21 | c | 1 | 1 | 27 |
| 22 | b | 1 | 3 | 47 |
| 23 | a | 2 | 2 | 17 |
| 24 | c | 1 | 1 | 14 |
| 25 | c | 2 | 2 | 25 |
| 26 | a | 1 | 3 | 67 |
| 27 | b | 2 | 3 | 21 |
| 28 | a | 1 | 3 | 27 |
| 29 | c | 1 | 1 | 16 |
| 30 | a | 2 | 2 | 14 |
| 31 | b | 1 | 2 | 25 |
+----+-------+---------+-------+-------+
data = {'Group':['a','c','a','c','a','b','a','b','c','b','c','d','a','d','a','d',
'b','c','b','a','a','c','b','a','c','c','a','b','a','c','a','b'],
'players':[1,2,1,1,2,1,2,2,1,1,2,1,2,1,2,2,1,2,2,2,1,1,1,2,1,2,1,2,1,1,2,1],
'class':[1,1,3,3,2,2,1,3,2,2,2,1,1,3,2,1,3,2,2,1,1,1,3,2,1,2,3,3,3,1,2,2],
'score':[10,20,29,22,14,16,16,17,22,23,22,13,13,23,13,34,32,29,28,19,19,27,47,17,14,25,67,21,27,16,14,25]}
df = pd.DataFrame(data)
kindly advice
Many thanks & Regards
Try:
Via set_index(),groupby() ,assign() and reset_index() method:
df=(df.set_index(['Group','players'])
.assign(score_mean=df[df['class'].eq(1)].groupby(['Group', 'players'])['score'].mean())
.reset_index())
Update:
If you want the first df as your output then use:
grouped=df.groupby(['Group', 'players','class']).transform('mean')
grouped=grouped.assign(players=df['players'],Group=df['Group'],Class=df['class']).where(df['Group']=='a').dropna()
grouped['score']=grouped.apply(lambda x:float('NaN') if x['players']==1 else x['score'],1)
grouped=grouped.dropna(subset=['score'])
Now if you print grouped you will get your desired output
If I got you right, need values returned only where class=1. Not sure what that will serve but code below. Use groupby transform and chain where
df['score_mean']=df.groupby(['Group','players'])['score'].transform('mean').where(df['class']==1).fillna('')
Group players class score score_mean
0 a 1 1 10 10
1 a 2 1 20 20
2 a 3 5 29
3 a 4 5 22
4 a 5 5 14
5 b 1 7 16
6 b 2 7 16
7 b 3 7 17
8 c 1 4 22
9 c 2 2 23
10 c 3 2 22
11 d 1 4 13
12 d 2 4 13
13 d 3 3 23
14 d 4 8 13
15 d 5 7 34
16 e 1 7 32
17 e 2 2 29
18 e 3 2 28
19 e 4 1 19 19
20 e 5 1 19 19
21 e 6 1 27 27
22 f 1 5 47
23 f 2 5 17
24 f 3 7 14
25 f 4 7 25
26 g 1 3 67
27 g 2 3 21
28 g 3 3 27
29 g 4 8 16
30 g 5 8 14
31 g 6 8 25
You could first filter by class and then create score_mean by doing a groupby and transform.
(
df[df['class']==1]
.assign(score_mean = lambda x: x.groupby(['Group', 'players']).score.transform('mean'))
)
Group players class score score_mean
0 a 1 1 10 14.5
1 c 2 1 20 20.0
6 a 2 1 16 16.0
11 d 1 1 13 13.0
12 a 2 1 13 16.0
15 d 2 1 34 34.0
19 a 2 1 19 16.0
20 a 1 1 19 14.5
21 c 1 1 27 19.0
24 c 1 1 14 19.0
29 c 1 1 16 19.0
If you want to keep other classes and set the mean to '', you can do:
(
df[df['class']==1]
.groupby(['Group', 'players']).score.transform('mean')
.pipe(lambda x: df.assign(score_mean = x))
.fillna('')
)
I'm trying to use python to solve my data analysis problem.
I have a table like this:
+----------+-----+------+--------+-------------+--------------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | Value_column |
+----------+-----+------+--------+-------------+--------------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 |
| 11 | 1 | 2020 | Name1 | QTRMAX | 6 |
| 11 | 2 | 2020 | Name1 | QTRMAX | 9 |
| 11 | 3 | 2020 | Name1 | QTRMAX | 7 |
| 11 | 4 | 2020 | Name1 | QTRMAX | 10 |
+----------+-----+------+--------+-------------+--------------+
I want to arrange the Value_column in a way that can capture when there is multiple Qtr_measures for unique IDs and MEF_IDs. When doing this, the overall size of the table will be reduced and I would like to have columns replacing Qtr_Measures with the type as below:
+----------+-----+------+--------+-------------+--------+--------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | QTRAVG | QTRMAX |
+----------+-----+------+--------+-------------+--------+--------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 | 6 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 | 9 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 | 7 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 | 10 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 | |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 | |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 | |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 | |
+----------+-----+------+--------+-------------+--------+--------+
How can I do this with python?
Thank you
Use pivot_table with reset_index and rename_axis:
piv = (df.pivot_table(index=['ID', 'QTR', 'Year', 'MEF_ID'],
values='Value_column',
columns='Qtr_Measure')
.reset_index()
.rename_axis(None, axis=1)
)
print(piv)
ID QTR Year MEF_ID QTRAVG QTRMAX
0 11 1 2020 Name1 5.0 6.0
1 11 2 2020 Name1 8.0 9.0
2 11 3 2020 Name1 6.0 7.0
3 11 4 2020 Name1 9.0 10.0
4 15 1 2020 Name2 67.0 NaN
5 15 2 2020 Name2 89.0 NaN
6 15 3 2020 Name2 100.0 NaN
7 15 4 2020 Name2 121.0 NaN
I have data that looks like this:
+------+---------+------+-------+
| Year | Cluster | AREA | COUNT |
+------+---------+------+-------+
| 2016 | 0 | 10 | 2952 |
| 2016 | 1 | 10 | 2556 |
| 2016 | 2 | 10 | 8867 |
| 2016 | 3 | 10 | 9786 |
| 2017 | 0 | 10 | 2470 |
| 2017 | 1 | 10 | 3729 |
| 2017 | 2 | 10 | 8825 |
| 2017 | 3 | 10 | 9114 |
| 2018 | 0 | 10 | 1313 |
| 2018 | 1 | 10 | 3564 |
| 2018 | 2 | 10 | 7245 |
| 2018 | 3 | 10 | 6990 |
+------+---------+------+-------+
I have to get the percentage changes for each cluster compared to the previous year, e.g.
+------+---------+-----------+-------+----------------+
| Year | Cluster | AREA | COUNT | Percent Change |
+------+---------+-----------+-------+----------------+
| 2016 | 0 | 10 | 2952 | NaN |
| 2017 | 0 | 10 | 2470 | -16.33% |
| 2018 | 0 | 10 | 1313 | -46.84% |
| 2016 | 1 | 10 | 2556 | NaN |
| 2017 | 1 | 10 | 3729 | 45.89% |
| 2018 | 1 | 10 | 3564 | -4.42% |
| 2016 | 2 | 10 | 8867 | NaN |
| 2017 | 2 | 10 | 8825 | -0.47% |
| 2018 | 2 | 10 | 7245 | -17.90% |
| 2016 | 3 | 10 | 9786 | NaN |
| 2017 | 3 | 10 | 9114 | -6.87% |
| 2018 | 3 | 10 | 6990 | -23.30% |
+------+---------+-----------+-------+----------------+
Is there any easy to do this?
I've tried a few things below, this seemed to make the most sense, but it returns NaN for each pct_change.
df['pct_change'] = df.groupby(['Cluster','Year'])['COUNT '].pct_change()
+------+---------+------+------------+------------+
| Year | Cluster | AREA | Count | pct_change |
+------+---------+------+------------+------------+
| 2016 | 0 | 10 | 295200.00% | NaN |
| 2016 | 1 | 10 | 255600.00% | NaN |
| 2016 | 2 | 10 | 886700.00% | NaN |
| 2016 | 3 | 10 | 978600.00% | NaN |
| 2017 | 0 | 10 | 247000.00% | NaN |
| 2017 | 1 | 10 | 372900.00% | NaN |
| 2017 | 2 | 10 | 882500.00% | NaN |
| 2017 | 3 | 10 | 911400.00% | NaN |
| 2018 | 0 | 10 | 131300.00% | NaN |
| 2018 | 1 | 10 | 356400.00% | NaN |
| 2018 | 2 | 10 | 724500.00% | NaN |
| 2018 | 3 | 10 | 699000.00% | NaN |
+------+---------+------+------------+------------+
Basically, I simply want the function to compare the year on year change for each cluster.
df['pct_change'] = df.groupby(['Cluster'])['Count'].pct_change()
df.sort_values('Cluster', axis = 0, ascending = True)
Another method going old school with transform
df['p'] = df.groupby('cluster')['count'].transform(lambda x: (x-x.shift())/x.shift())
df = df.sort_values(by='cluster')
print(df)
year cluster area count p
0 2016 0 10 2952 NaN
4 2017 0 10 2470 -0.163279
8 2018 0 10 1313 -0.468421
1 2016 1 10 2556 NaN
5 2017 1 10 3729 0.458920
9 2018 1 10 3564 -0.044248
2 2016 2 10 8867 NaN
6 2017 2 10 8825 -0.004737
10 2018 2 10 7245 -0.179037
3 2016 3 10 9786 NaN
7 2017 3 10 9114 -0.068670
11 2018 3 10 6990 -0.233048