Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
I have a data frame that contains daily data of the last five years. Beside values column, data frame also contains date field and regulatory year columns. I wanted to create two columns: the regulatory week number and the regulatory month number. The regulatory year starts from the 1st of April and ends on 31st March. So I used the following code to generate regulatory week number and month number:
df['Week'] = np.where(df['date'].dt.isocalendar().week > 13, df['date'].dt.isocalendar().week-13,df['date'].dt.isocalendar().week + 39)
df['month'] =df['date'].dt.month
months = ['Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=months)
df['month number'] = df['month'].apply(lambda x: months.index(x)+1)
After creating the above-mentioned two columns, my data frame looks like as follow:
RY month Week Value 1 Value 2 Value 3 Value 4 month number
2016 Apr 1 0.00000 0.00000 0.000000 0.00000 1
2016 Apr 2 1.31394 0.02961 1.313940 0.02961 1
2016 Apr 3 4.98354 0.07146 4.983540 0.07146 1
2016 Apr 4 4.30606 0.05742 4.306060 0.05742 1
2016 Apr 5 1.94634 0.01958 1.946340 0.01958 1
2016 May 5 0.25342 0.01625 0.253420 0.01625 2
2016 May 6 0.64051 0.00777 0.640510 0.00777 2
2016 May 7 1.26451 0.02994 1.264510 0.02994 2
2016 May 8 2.71035 0.08150 2.194947 0.08150 2
2016 May 9 11.95120 0.13386 1.624328 0.13386 2
2016 Jun 10 6.93051 0.08126 6.930510 0.08126 3
2016 Jun 11 1.18872 0.03953 1.188720 0.03953 3
2016 Jun 12 3.19961 0.05760 0.924562 0.05760 3
2016 Jun 13 3.90429 0.04985 0.956445 0.04985 3
2016 Jun 14 0.84002 0.01738 0.840020 0.01738 3
2016 Jul 14 0.07358 0.00562 0.073580 0.00562 4
2016 Jul 15 0.78253 0.03014 0.782530 0.03014 4
2016 Jul 16 1.23036 0.01816 1.230360 0.01816 4
2016 Jul 17 0.62948 0.01341 0.629480 0.01341 4
2016 Jul 18 0.45513 0.00552 0.455130 0.00552 4
Now I want to create a data frame that contains mean of values column based on Week. So I used following command to calculate the mean:
mean_df = df.groupby('Week')['Value1','Value2','Value3','Value4'].mean().reset_index()
The new dataframe looks like as follow:
Week Value 1 Value 2 Value 3 Value 4
1 3.013490 0.039740 1.348016 0.039740
2 3.094456 0.045142 3.094456 0.045142
3 1.615948 0.027216 1.615948 0.027216
4 2.889245 0.043998 1.903319 0.043998
5 0.431549 0.009679 0.431549 0.009679
6 1.045670 0.017302 1.045670 0.017302
7 2.444196 0.034304 2.444196 0.034304
8 1.041210 0.026464 0.938129 0.026464
9 2.068607 0.030550 0.921176 0.030550
10 2.400118 0.051476 2.400118 0.051476
11 1.738332 0.035362 1.738332 0.035362
12 1.369790 0.038576 0.914780 0.038576
13 1.921781 0.021218 0.749460 0.021218
14 1.471432 0.027367 1.471432 0.027367
15 2.722526 0.053794 1.676559 0.053794
16 3.132406 0.043520 1.195321 0.043520
17 0.733952 0.021142 0.733952 0.021142
18 0.645236 0.014454 0.645236 0.014454
19 2.466326 0.049704 0.879481 0.049704
20 2.111326 0.013262 0.682253 0.013262
21 1.301004 0.023048 1.301004 0.023048
22 0.705360 0.023439 0.705360 0.023439
23 1.323438 0.019103 1.323438 0.019103
24 0.569906 0.012540 0.569906 0.012540
25 7.898792 0.034246 1.382349 0.034246
26 0.896413 0.013013 0.896413 0.013013
27 4.478349 0.039749 1.703887 0.039749
28 5.807160 0.052526 2.036502 0.052526
29 3.308176 0.043984 2.117939 0.043984
30 1.991078 0.046058 1.991078 0.046058
31 0.806589 0.016945 0.806589 0.016945
32 2.091860 0.029234 2.091860 0.029234
33 1.149280 0.025194 1.149280 0.025194
34 4.746376 0.067742 2.863484 0.067742
35 5.128558 0.029608 1.537541 0.029608
36 2.765563 0.052125 2.765563 0.052125
37 2.314376 0.036046 2.314376 0.036046
38 2.552290 0.030626 1.483397 0.030626
39 1.456778 0.037448 1.456778 0.037448
40 1.212090 0.024698 1.212090 0.024698
41 4.729104 0.037646 1.296358 0.037646
42 3.412830 0.053132 3.412830 0.053132
43 8.916526 0.050044 1.839411 0.050044
44 2.450281 0.029806 0.942205 0.029806
45 2.156186 0.024064 2.156186 0.024064
46 2.336330 0.042538 2.336330 0.042538
47 1.798326 0.025270 1.798326 0.025270
48 1.352004 0.018382 1.352004 0.018382
49 10.220510 0.073480 1.607830 0.073480
50 2.575344 0.047760 2.575344 0.047760
51 1.226056 0.028676 1.226056 0.028676
52 0.470392 0.009991 0.466561 0.009991
Now I want to insert the month and month name from the above data frame to the new data frame. I thought to merge the two data frames together based on 'Week' but I found that the same week number is assigned to the two different months (in the first data frame). For example, Week 5 is assigned to April and May.
Ideally, a week number is assigned to only one month. I am not sure whether I am calculating the week number in the right manner or not. Has anyone come across the same problem? Any advice on how to calculate the week number so that a week number does not overlap with two months.
Presumably, week 5 contains some days in April and some in May. So it's not possible to assign week 5 (as a whole) to a single month.
Perhaps you could assign the month in which the first day of the week falls?
Below is the basic data I'm provided with every month. There are many department related files I get and the job gets very monotonous and repetitive.
Month,year,sales,
January,2017,34400,
February,2017,35530,
March,2017,34920,
April,2017,35950,
May,2017,36230,
June,2017,36820,
July,2017,34590,
August,2017,36500,
September,2017,36600,
October,2017,37140,
November,2017,36790,
December,2017,43500,
January,2018,34900,
February,2018,37700,
March,2018,37900,
April,2018,38100,
May,2018,37800,
June,2018,38500,
July,2018,39400,
August,2018,39700,
September,2018,39980,
October,2018,40600,
November,2018,39100,
December,2018,46600,
January,2019,42500,
I've tried to use certain functions like value_count(sadly, giving only summary) in order to achieve this output. And failed. (See output below.)
I need to autofill the 3rd and 4th columns (with fillna=True/False)
the third column is just telling if it is P/L compared to previous month (like if April is greater than March, then it is Profit.)
The fourth column is showing the sequence of P/L achieved, i.e. 2 months or 5months profit(/loss) in a row. (I mean continuously, as it results in certain awards/recognition for teams.)
The fifth column is the max sales achieved in the last 'n' number of months.
They only allow Apache OpenOffice for our job, and hence no Excel. But we have the permission by IT to install Python.
The solution in this Link is not helping me as they are grouping-by two columns. The columns in my output are inter-dependent.
import pandas as pd
df = pd.read_csv("Test_1.csv", "a")
df['comparative_position'] = df['sales'].diff().fillna=True
df.loc[df['comparative_position'] > 0.0, 'comparative_position'] = "Profit"
df.loc[df['comparative_position'] < 0.0, 'comparative_position'] = "Loss"
Month,Year,Sales,comparative_position,Months_in_P(or)L,Highest_in_12Months
January,2016,34400,NaN,NaN,NaN
February,2016,35530,Profit,1,NaN
March,2016,34920,Loss,1,NaN
April,2016,35950,Profit,1,NaN
May,2016,36230,Profit,2,NaN
June,2016,36820,Profit,3,NaN
July,2016,34590,Loss,1,NaN
August,2016,36500,Profit,1,NaN
September,2016,36600,Profit,2,NaN
October,2016,37140,Profit,3,NaN
November,2016,36790,Loss,1,NaN
December,2016,43500,Profit,1,43500
January,2017,34900,Loss,1,43500
February,2017,37700,Profit,1,43500
March,2017,37900,Profit,2,43500
April,2017,38100,Profit,3,43500
May,2017,37800,Loss,1,43500
June,2017,38500,Profit,1,43500
July,2017,39400,Profit,2,43500
August,2017,39700,Profit,3,43500
September,2017,39980,Profit,4,43500
October,2017,40600,Profit,5,43500
November,2017,39100,Loss,1,43500
December,2017,46600,Profit,1,46600
January,2018,42500,Loss,1,46600
AFAIU this should work for you:
# Get difference from previous as True / False
df['P/L'] = df.sales > df.sales.shift()
# Add column counting 'streaks' of P or L
df['streak'] = df['P/L'].groupby(df['P/L'].ne(df['P/L'].shift()).cumsum()).cumcount()
# map True/False to string of Profit/Loss
df['P/L'] = df['P/L'].map({True:'Profit', False:'Loss'})
# max of last n months where n is 12, as in your example, you can change it to any int
df['12_max'] = df.sales.rolling(12).max()
Output:
Month year sales P/L streak 12_max
0 January 2017 34400 False 0 NaN
1 February 2017 35530 True 0 NaN
2 March 2017 34920 False 0 NaN
3 April 2017 35950 True 0 NaN
4 May 2017 36230 True 1 NaN
5 June 2017 36820 True 2 NaN
6 July 2017 34590 False 0 NaN
7 August 2017 36500 True 0 NaN
8 September 2017 36600 True 1 NaN
9 October 2017 37140 True 2 NaN
10 November 2017 36790 False 0 NaN
11 December 2017 43500 True 0 43500.0
12 January 2018 34900 False 0 43500.0
13 February 2018 37700 True 0 43500.0
14 March 2018 37900 True 1 43500.0
15 April 2018 38100 True 2 43500.0
16 May 2018 37800 False 0 43500.0
17 June 2018 38500 True 0 43500.0
18 July 2018 39400 True 1 43500.0
19 August 2018 39700 True 2 43500.0
20 September 2018 39980 True 3 43500.0
21 October 2018 40600 True 4 43500.0
22 November 2018 39100 False 0 43500.0
23 December 2018 46600 True 0 46600.0
24 January 2019 42500 False 0 46600.0
Trying to create new dataframe columns from the contents of an existing column. Easier to explain with an example. I would like to convert this:
. Yr Month Class Cost
1 2015 1 L 19.2361
2 2015 1 M 29.4723
3 2015 1 S 48.5980
4 2015 1 T 169.7630
5 2015 2 L 19.1506
6 2015 2 M 30.0886
7 2015 2 S 49.3765
8 2015 2 T 167.0000
9 2015 3 L 19.3465
10 2015 3 M 29.1991
11 2015 3 S 46.2580
12 2015 3 T 157.7916
13 2015 4 L 18.3165
14 2015 4 M 28.2314
15 2015 4 S 44.5844
16 2015 4 T 162.3241
17 2015 5 L 17.4556
18 2015 5 M 27.0434
19 2015 5 S 42.8841
20 2015 5 T 159.3457
21 2015 6 L 16.5343
22 2015 6 M 24.9853
23 2015 6 S 40.5612
24 2015 6 T 153.4902
...into the following so that I can plot 4 separate lines [L, M, S, T]:
. Yr Month L M S T
1 2015 1 19.2361 29.4723 48.5980 169.7630
2 2015 2 19.1506 30.0886 49.3765 167.0000
3 2015 3 19.3465 29.1991 46.2580 157.7916
4 2015 4 18.3165 28.2314 44.5844 162.3241
5 2015 5 17.4556 27.0434 42.8841 159.3457
6 2015 6 16.5343 24.9853 40.5612 153.4902
I was able to work through it in what feels like a very clumsy way, by filtering the dataframe on the 'class' column... and then 3 separate merges.
list_class = ['L', 'M', 'S', 'T']
year = 'Yr'
month = 'Month'
df_class = pd.DataFrame()
df_class1 = pd.DataFrame()
df_class2 = pd.DataFrame()
df_class1 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[0]], df[[month, year, 'Class','Cost']][df['Class']==list_class[1]], \
left_on=[month, year], right_on=[month, year])
df_class2 = merge(df[[month, year, 'Class','Cost']][df['Class']==list_class[2]], df[[month, year, 'Class','Cost']][df['Class']==list_class[3]], \
left_on=[month, year], right_on=[month, year])
df_class = merge(df_class1, df_class2, left_on=[month, year], right_on=[month, year]).groupby([year, month]).mean().plot(figsize(15,8))
There must be a more efficient way. Feels like it should be done with groupby, but I couldn't nail it down.
You can first convert the df to a multi-level index type and then unstack the level Class will give you what you want. Suppose df is the original dataframe shown on the very beginning of your post.
df.set_index(['Yr', 'Month', 'Class'])['Cost'].unstack('Class')
Out[29]:
Class L M S T
Yr Month
2015 1 19.2361 29.4723 48.5980 169.7630
2 19.1506 30.0886 49.3765 167.0000
3 19.3465 29.1991 46.2580 157.7916
4 18.3165 28.2314 44.5844 162.3241
5 17.4556 27.0434 42.8841 159.3457
6 16.5343 24.9853 40.5612 153.4902