Pandas cumulative sum without changing week order number - python

I have a dataframe which looks like the following:
df:
RY Week no Value
2020 14 3.95321
2020 15 3.56425
2020 16 0.07042
2020 17 6.45417
2020 18 0.00029
2020 19 0.27737
2020 20 4.12644
2020 21 0.32753
2020 22 0.47239
2020 23 0.28756
2020 24 1.83029
2020 25 0.75385
2020 26 2.08981
2020 27 2.05611
2020 28 1.00614
2020 29 0.02105
2020 30 0.58101
2020 31 3.49083
2020 32 8.29013
2020 33 8.99825
2020 34 2.66293
2020 35 0.16448
2020 36 2.26301
2020 37 1.09302
2020 38 1.66566
2020 39 1.47233
2020 40 6.42708
2020 41 2.67947
2020 42 6.79551
2020 43 4.45881
2020 44 1.87972
2020 45 0.76284
2020 46 1.8671
2020 47 2.07159
2020 48 2.87303
2020 49 7.66944
2020 50 1.20421
2020 51 9.04416
2020 52 2.2625
2020 1 1.17026
2020 2 14.22263
2020 3 1.36464
2020 4 2.64862
2020 5 8.69916
2020 6 4.51259
2020 7 2.83411
2020 8 3.64183
2020 9 4.77292
2020 10 1.64729
2020 11 1.6878
2020 12 2.24874
2020 13 0.32712
I created a week no column using date. In my scenario regulatory year starts from 1st April and ends at 31st March of next year that's why Week no starts from 14 and ends at 13. Now I want to create another column that contains the cumulative sum of the value column. I tried to use cumsum() by using the following code:
df['Cummulative Value'] = df.groupby('RY')['Value'].apply(lambda x:x.cumsum())
The problem with the above code is that it starts calculating the cumulative sum from week no 1 not from week no 14 onwards. Is there any way to calculate the cumulative sum without disturbing the week order number?

EDIT: You can sorting values by RY and Week no before GroupBy.cumsum and last sorting index for original order:
#create default index for correct working
df = df.reset_index(drop=True)
df['Cummulative Value'] = df.sort_values(['RY','Week no']).groupby('RY')['Value'].cumsum().sort_index()
print (df)
RY Week no Value Cummulative Value
0 2020 14 3.95321 53.73092
1 2020 15 3.56425 57.29517
2 2020 16 0.07042 57.36559
3 2020 17 6.45417 63.81976
4 2020 18 0.00029 63.82005
5 2020 19 0.27737 64.09742
6 2020 20 4.12644 68.22386
7 2020 21 0.32753 68.55139
8 2020 22 0.47239 69.02378
9 2020 23 0.28756 69.31134
10 2020 24 1.83029 71.14163
11 2020 25 0.75385 71.89548
12 2020 26 2.08981 73.98529
13 2020 27 2.05611 76.04140
14 2020 28 1.00614 77.04754
15 2020 29 0.02105 77.06859
16 2020 30 0.58101 77.64960
17 2020 31 3.49083 81.14043
18 2020 32 8.29013 89.43056
19 2020 33 8.99825 98.42881
20 2020 34 2.66293 101.09174
21 2020 35 0.16448 101.25622
22 2020 36 2.26301 103.51923
23 2020 37 1.09302 104.61225
24 2020 38 1.66566 106.27791
25 2020 39 1.47233 107.75024
26 2020 40 6.42708 114.17732
27 2020 41 2.67947 116.85679
28 2020 42 6.79551 123.65230
29 2020 43 4.45881 128.11111
30 2020 44 1.87972 129.99083
31 2020 45 0.76284 130.75367
32 2020 46 1.86710 132.62077
33 2020 47 2.07159 134.69236
34 2020 48 2.87303 137.56539
35 2020 49 7.66944 145.23483
36 2020 50 1.20421 146.43904
37 2020 51 9.04416 155.48320
38 2020 52 2.26250 157.74570
39 2020 1 1.17026 1.17026
40 2020 2 14.22263 15.39289
41 2020 3 1.36464 16.75753
42 2020 4 2.64862 19.40615
43 2020 5 8.69916 28.10531
44 2020 6 4.51259 32.61790
45 2020 7 2.83411 35.45201
46 2020 8 3.64183 39.09384
47 2020 9 4.77292 43.86676
48 2020 10 1.64729 45.51405
49 2020 11 1.68780 47.20185
50 2020 12 2.24874 49.45059
51 2020 13 0.32712 49.77771
EDIT:
After some discussion solution should be simplify by GroupBy.cumsum:
df['Cummulative Value'] = df.groupby('RY')['Value'].cumsum()

Related

Find sum of values of a column spread over different months in Python

I have a table that looks like
A
B
C
2017
9
65
2017
10
72
2017
11
88
2017
12
97
2018
1
85
2018
2
67
2018
3
76
2018
4
51
2018
5
69
2018
6
97
2018
7
101
2018
8
22
2019
1
56
2019
2
34
2019
3
71
2019
4
122
2019
5
167
2019
6
34
2019
7
17
2019
8
99
2019
9
20
2019
10
26
2019
11
39
2019
12
30
2020
1
56
2020
2
34
2020
3
71
2020
4
122
2020
5
167
2020
6
34
2020
7
17
2020
8
99
2020
9
20
2020
10
26
2020
11
39
2020
12
30
2021
1
56
2021
2
34
2021
3
71
2021
4
122
2021
5
167
2021
6
34
2021
7
17
2021
8
99
2021
9
20
2021
10
26
2021
11
39
2021
12
30
Now what I want is :
A
B
C
D
2017
9
65
890
2017
10
72
890
2017
11
88
890
2017
12
97
890
2018
1
85
890
2018
2
67
890
2018
3
76
890
2018
4
51
890
2018
5
69
890
2018
6
97
890
2018
7
101
890
2018
8
22
890
2019
1
56
715
2019
2
34
715
2019
3
71
715
2019
4
122
715
2019
5
167
715
2019
6
34
715
2019
7
17
715
2019
8
99
715
2019
9
20
715
2019
10
26
715
2019
11
39
715
2019
12
30
715
2020
1
56
715
2020
2
34
715
2020
3
71
715
2020
4
122
715
2020
5
167
715
2020
6
34
715
2020
7
17
715
2020
8
99
715
2020
9
20
715
2020
10
26
715
2020
11
39
715
2020
12
30
715
2021
1
56
715
2021
2
34
715
2021
3
71
715
2021
4
122
715
2021
5
167
715
2021
6
34
715
2021
7
17
715
2021
8
99
715
2021
9
20
715
2021
10
26
715
2021
11
39
715
2021
12
30
715
Here 890 is the sum of all the values from 9,2017 through 8,2018 and 715 is the sum of all values from 1,2019 through 12,2019 and similarly 715 is the sum of all values from 1,2020 through 12,2020 and similarly 715 is the sum of all values from 1,2021 through 12,2021. For ease of calculation the numbers in column C have been taken the same i.e, (56,34,71,122,167,34,17,99,20,26,39,30) for each of 2019, 2020 and 2021. These numbers may vary for each of the years and subsequently their sums. That is we could have values like (67,87,99,100,76,11,23,44,56,78,87,5) for 2020 and (12,13,14,15,16,17,18,19,20,21,22,23) for 2021 for the months (1,2,3,4,5,6,7,8,,10,11,12) subsequently.
Now my efforts:
count_months_in_each_year = data.groupby('CALENDAR_YEAR').agg({'CALMONTH':'count'})
count_months_in_each_year.reset_index(inplace = True)
count_months_in_each_year.rename({'CALMONTH':'Count_of_Months'}, axis =1, inplace = True)
data = pd.merge(data, count_months_in_each_year, on = 'CALENDAR_YEAR', how = 'left', indicator = True )
data.drop(columns = ['_merge'], axis =1 , inplace = True)
Now how to get the sum of the values especially in case when I have to consider 9,2017 through 8,2018 although I have the count.
Now based on this what logic can be driven to generalize the code in order to get the result.
I also tried this :
####Compute total number of records - number of records which have count of months < 12
number_ofless_than_12_records = data.shape[0] - data[data['Count_of_Months']==12].shape[0]
#number_ofless_than_12_records = 144.
#Total records = 576
Can we make use of this somehow?
I think what you are looking for is making groups of 12 rows and transform with the group sum:
df['D'] = df.groupby(df.index // 12)['C'].transform('sum')
A B C D
0 2017 9 65 890
1 2017 10 72 890
2 2017 11 88 890
3 2017 12 97 890
4 2018 1 85 890
5 2018 2 67 890
6 2018 3 76 890
7 2018 4 51 890
8 2018 5 69 890
9 2018 6 97 890
10 2018 7 101 890
11 2018 8 22 890
12 2019 1 56 715
13 2019 2 34 715
14 2019 3 71 715
15 2019 4 122 715
16 2019 5 167 715
17 2019 6 34 715
18 2019 7 17 715
19 2019 8 99 715
20 2019 9 20 715
21 2019 10 26 715
22 2019 11 39 715
23 2019 12 30 715
24 2020 1 56 715
25 2020 2 34 715
26 2020 3 71 715
27 2020 4 122 715
28 2020 5 167 715
29 2020 6 34 715
30 2020 7 17 715
31 2020 8 99 715
32 2020 9 20 715
33 2020 10 26 715
34 2020 11 39 715
35 2020 12 30 715
36 2021 1 56 715
37 2021 2 34 715
38 2021 3 71 715
39 2021 4 122 715
40 2021 5 167 715
41 2021 6 34 715
42 2021 7 17 715
43 2021 8 99 715
44 2021 9 20 715
45 2021 10 26 715
46 2021 11 39 715
47 2021 12 30 715
You can use pandas rolling window function https://pandas.pydata.org/docs/user_guide/window.html
df['D'] = df['C'].rolling(window=12).sum()
This will calculate the sum of the current month and 11 rows back. But it will fill with NaN values in the beginning, until there are enough months to look back.
So we can shift up 11 rows to get the wanted result.
df['D'] = df['D'].shift(-11)
And if you want don't want any NaNs at the end, you can interpolate or pad it out.
df['D'] = df['D'].interpolate()

Fill column with value from previous year from the same month

How can I use the value from the same month in the previous year to fill values in the following table for 2020:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
The final desired result is the following:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
B 4 2020 40
A 4 2020 40
A 5 2020 20
A 6 2020 15
A 7 2020 17
A 8 2020 18
A 9 2020 12
A 10 2020 11
A 11 2020 19
A 12 2020 15
I tried using pandas groupby but not sure if that is the right approach.
IIUC we use the pivot then ffill with stack
s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index()
Category Year level_2 0
0 A 2019 1 15.0
1 A 2019 2 90.0
2 A 2019 3 50.0
3 A 2019 5 20.0
4 A 2019 6 15.0
5 A 2019 7 17.0
6 A 2019 8 18.0
7 A 2019 9 12.0
8 A 2019 10 11.0
9 A 2019 11 19.0
10 A 2019 12 15.0
11 A 2020 1 18.0
12 A 2020 2 53.0
13 A 2020 3 80.0
14 A 2020 5 20.0
15 A 2020 6 15.0
16 A 2020 7 17.0
17 A 2020 8 18.0
18 A 2020 9 12.0
19 A 2020 10 11.0
20 A 2020 11 19.0
21 A 2020 12 15.0
22 B 2019 2 20.0
23 B 2019 4 40.0
You can accomplish this with a combination of loc, concat, and drop_duplicates.
The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year.
df2 = df.loc[df['Year'] == 2019, :]
df2['Year'] = 2020
pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first')
Output
Category Month Year Value
0 A 1 2019 15
1 B 2 2019 20
2 A 2 2019 90
3 A 3 2019 50
4 B 4 2019 40
5 A 5 2019 20
6 A 6 2019 15
7 A 7 2019 17
8 A 8 2019 18
9 A 9 2019 12
10 A 10 2019 11
11 A 11 2019 19
12 A 12 2019 15
13 A 1 2020 18
14 A 2 2020 53
15 A 3 2020 80
1 B 2 2020 20
4 B 4 2020 40
5 A 5 2020 20
6 A 6 2020 15
7 A 7 2020 17
8 A 8 2020 18
9 A 9 2020 12
10 A 10 2020 11
11 A 11 2020 19
12 A 12 2020 15

Python dataframe group by column and create new column with percentage [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 5 years ago.
I have a scenario simulating to a dataframe which looks something like below:
Month Amount
1 Jan 260
2 Feb 179
3 Mar 153
4 Apr 142
5 May 128
6 Jun 116
7 Jul 71
8 Aug 56
9 Sep 49
10 Oct 17
11 Nov 0
12 Dec 0
I'm trying to get new column by calculating percentage for each row using dataframe group by and use lambda function as below:
df = pd.DataFrame(mylistofdict)
df = df.groupby('Month')["Amount"].apply(lambda x: x / x.sum()*100)
But I'm not getting the expected result below only 2 columns:
Month Percentage
1 Jan 22%
2 Feb 15%
3 Mar 13%
4 Apr 12%
5 May 11%
6 Jun 10%
7 Jul 6%
8 Aug 5%
9 Sep 4%
10 Oct 1%
11 Nov 0
12 Dec 0
How do i modify my code or is there anything better than use dataframe.
If values of Month are unique use:
df['perc'] = df["Amount"] / df["Amount"].sum() * 100
print (df)
Month Amount perc
1 Jan 260 22.203245
2 Feb 179 15.286080
3 Mar 153 13.065756
4 Apr 142 12.126388
5 May 128 10.930828
6 Jun 116 9.906063
7 Jul 71 6.063194
8 Aug 56 4.782237
9 Sep 49 4.184458
10 Oct 17 1.451751
11 Nov 0 0.000000
12 Dec 0 0.000000
If values of Month are duplicated I believe is possible use:
print (df)
Month Amount
1 Jan 260
1 Jan 100
3 Mar 153
4 Apr 142
5 May 128
6 Jun 116
7 Jul 71
8 Aug 56
9 Sep 49
10 Oct 17
11 Nov 0
12 Dec 0
df = df.groupby('Month', as_index=False, sort=False)["Amount"].sum()
df['perc'] = df["Amount"] / df["Amount"].sum() * 100
print (df)
Month Amount perc
0 Jan 360 32.967033
1 Mar 153 14.010989
2 Apr 142 13.003663
3 May 128 11.721612
4 Jun 116 10.622711
5 Jul 71 6.501832
6 Aug 56 5.128205
7 Sep 49 4.487179
8 Oct 17 1.556777
9 Nov 0 0.000000
10 Dec 0 0.000000

Melt or Stack groups of columns on python pandas

I have a pandas DataFrame like this
year id1 id2 jan jan1 jan2 feb feb1 feb2 mar mar1 mar2 ....
2018 01 10 3 30 31 2 23 25 7 52 53 ....
2018 01 20 ....
2018 02 10 ....
2018 02 20 ....
and I need this format
year month id1 id2 val val1 val2
2018 01 01 10 3 30 31
2018 02 01 10 2 23 25
2018 03 01 10 7 52 53
..........
As you can see, I have 3 values for each month, and I only add one column assigned to the month with 3 columns for the values. If it were only one column, I think I could use stack.
I wouldn't have any problem renaming the month columns to 01 01-1 01-2 (for january) or something like that to make it easier.
I'm also thinking on separating the info on 3 different DataFrames to stack them separately and then merge the results, or should I melt it?
Any ideas for achieving this easily?
using reshape and stack
pd.DataFrame(df.set_index(['year','id1','id2']).values.reshape(4,3,3).tolist(),
index=df.set_index(['year','id1','id2']).index,
columns=[1,2,3])\
.stack().apply(pd.Series).reset_index().rename(columns={'level_3':'month'})
Out[261]:
year id1 id2 month 0 1 2
0 2018 1 10 1 3 30 31
1 2018 1 10 2 2 23 25
2 2018 1 10 3 7 52 53
3 2018 1 20 1 3 30 31
4 2018 1 20 2 2 23 25
5 2018 1 20 3 7 52 53
6 2018 2 10 1 3 30 31
7 2018 2 10 2 2 23 25
8 2018 2 10 3 7 52 53
9 2018 2 20 1 3 30 31
10 2018 2 20 2 2 23 25
11 2018 2 20 3 7 52 53
So I renamed the header columns this way
01 01 01 02 02 02 03 03 03 ...
year id1 id2 val val1 val2 val val1 val2 val val1 val2 ....
2018 01 10 3 30 31 2 23 25 7 52 53 ....
2018 01 20 ....
2018 02 10 ....
2018 02 20 ....
on a file, and opened it this way
df = pd.read_csv('my_file.csv',header=[0, 1], index_col=[0,1,2], skipinitialspace=True, tupleize_cols=True)
df.columns = pd.MultiIndex.from_tuples(df.columns)
then, I actually only needed to stack it on level 0
df = df.stack(level=0)
and add the titles
df.index.names = ['year','id1','id2','month']
df = df.reset_index()

query a csv using pandas

I have a csv file like this:
year month Company A Company B Company C
1990 Jan 10 15 20
1990 Feb 11 14 21
1990 mar 13 8 23
1990 april 12 22 19
1990 may 15 12 18
1990 june 18 13 13
1990 june 12 14 15
1990 july 12 14 16
1991 Jan 11 16 13
1991 Feb 14 17 11
1991 mar 23 13 12
1991 april 23 21 10
1991 may 22 22 9
1991 june 24 20 32
1991 june 12 14 15
1991 july 21 14 16
1992 Jan 10 13 26
1992 Feb 9 11 19
1992 mar 23 12 18
1992 april 12 10 21
1992 may 17 9 10
1992 june 15 42 9
1992 june 16 9 26
1992 july 15 26 19
1993 Jan 18 19 20
1993 Feb 19 18 21
1993 mar 20 21 23
1993 april 21 10 19
1993 may 13 9 14
1993 june 14 23 23
1993 june 15 21 23
1993 july 16 10 22
I want to find out for each company the month and year where they had the highest number of sale for ex: for company A in year 1990 they had highest sale of 18. I want to do this using pandas. but to understand how to proceed with this. pointers needed please.
ps: here is what I have done till now.
import pandas as pd
df = pd.read_csv('SAMPLE.csv')
num_of_rows = len(df.index)
years_list = []
months_list = []
company_list = df.columns[2:]
for each in df.columns[2:]:
each = []
for i in range(0,num_of_rows):
years_list.append(df[df.columns[0]][i])
months_list.append(df[df.columns[1]][i])
years_list = list(set(years_list))
months_list = list(set(months_list))
for each in years_list:
for c in company_list:
print df[(df.year == each)][c].max()
I am getting the biggest number for a year for a company but how to get the month and year also I dont know.
Use a combination of idxmax() and loc to filter the dataframe:
In [36]:
import pandas as pd
import io
temp = """year month Company_A Company_B Company_C
1990 Jan 10 15 20
1990 Feb 11 14 21
1990 mar 13 8 23
1990 april 12 22 19
1990 may 15 12 18
1990 june 18 13 13
1990 june 12 14 15
1990 july 12 14 16
1991 Jan 11 16 13
1991 Feb 14 17 11
1991 mar 23 13 12
1991 april 23 21 10
1991 may 22 22 9
1991 june 24 20 32
1991 june 12 14 15
1991 july 21 14 16
1992 Jan 10 13 26
1992 Feb 9 11 19
1992 mar 23 12 18
1992 april 12 10 21
1992 may 17 9 10
1992 june 15 42 9
1992 june 16 9 26
1992 july 15 26 19
1993 Jan 18 19 20
1993 Feb 19 18 21
1993 mar 20 21 23
1993 april 21 10 19
1993 may 13 9 14
1993 june 14 23 23
1993 june 15 21 23
1993 july 16 10 22"""
df = pd.read_csv(io.StringIO(temp),sep='\s+')
# the call to .unique() is because the same row for A and C appears twice
df.loc[df[['Company_A', 'Company_B', 'Company_C']].idxmax().unique()]
Out[36]:
year month Company_A Company_B Company_C
13 1991 june 24 20 32
21 1992 june 15 42 9

Categories