Suppose I have a Python Pandas dataframe with 10 rows and 16 columns. Each row stands for one product. The first column is product ID. Other 15 columns are selling price for
2010/01,2010/02,2010/03,2010/05,2010/06,2010/07,2010/08,2010/10,2010/11,2010/12,2011/01,2011/02,2011/03,2011/04,2011/05.
(The column name is in strings, not in date format) Now I want to calculate the mean selling price each quarter (1Q2010,2Q2010,...,2Q2011), I don't know how to deal with it. (Note that there is missing month for 2010/04, 2010/09 and 2011/06.)
The description above is just an example. Because this data set is quite small. It is possible to loop manually. However, the real data set I work on is 10730*202. Therefore I can not manually check which month is actually missing or map quarters manually. I wonder what efficient way I can apply here.
Thanks for the help!
This should help.
import pandas as pd
import numpy as np
rng = pd.DataFrame({'date': pd.date_range('1/1/2011', periods=72, freq='M'), 'value': np.arange(72)})
df = rng.groupby([rng.date.dt.quarter, rng.date.dt.year]) .mean()
df.index.names = ['quarter', 'year']
df.columns = ['mean']
print df
mean
quarter year
1 2011 1
2012 13
2013 25
2014 37
2015 49
2016 61
2 2011 4
2012 16
2013 28
2014 40
2015 52
2016 64
3 2011 7
2012 19
2013 31
2014 43
2015 55
2016 67
4 2011 10
2012 22
2013 34
2014 46
2015 58
2016 70
Related
Currently I'm using pandas and numpy to play around with a data set on rain measurements in India, however I'm stumped on trying to create a particular column. Currently my data set looks like this:
SUBDIVISION
JAN
FEB
MAR
APR
MAY
Andaman & Nicobar Islands
50
70
90
250
430
Arunachal Pradesh
46
90
151
265
356
Assam & Meghalaya
16
31
79
505
340
Bihar
13
14
100
16
53
What I want is to replace all the columns that have the months with a single column "Months", and I want this column to contain the name of the month that has the most amount of rain, so for example it would look like this:
SUBDIVISION
Months
Andaman & Nicobar Islands
MAY
Arunachal Pradesh
MAY
Assam & Meghalaya
APR
Bihar
MAR
My data set is much larger than this so trying to manually input all of the data would not be worth it. So, I'm hoping there's a way to do what I'm wanting in Python
Use
# get column name of max values in month columns
df.set_index('SUBDIVISION').idxmax(1).reset_index(name='Months')
You can use pd.melt to transform your data first.
import pandas as pd
df = pd.DataFrame({
'subdivision': ['a','b'],
'jun': [1,2],
'july': [2,1]
})
df = pd.melt(df, id_vars=['subdivision'], var_name='month', value_name='rain')
df
df:
subdivision
month
rain
a
jun
1
b
jun
2
a
july
2
b
july
1
Then, sort value with rain value and drop_duplicates subdivision for keeping only the row having max rain value in each subdivision
df = df.sort_values('rain', ascending=False).drop_duplicates(['subdivision'])
Output:
subdivision
month
rain
b
jun
2
a
july
2
Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
I have a data frame that contains daily data of the last five years. Beside values column, data frame also contains date field and regulatory year columns. I wanted to create two columns: the regulatory week number and the regulatory month number. The regulatory year starts from the 1st of April and ends on 31st March. So I used the following code to generate regulatory week number and month number:
df['Week'] = np.where(df['date'].dt.isocalendar().week > 13, df['date'].dt.isocalendar().week-13,df['date'].dt.isocalendar().week + 39)
df['month'] =df['date'].dt.month
months = ['Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=months)
df['month number'] = df['month'].apply(lambda x: months.index(x)+1)
After creating the above-mentioned two columns, my data frame looks like as follow:
RY month Week Value 1 Value 2 Value 3 Value 4 month number
2016 Apr 1 0.00000 0.00000 0.000000 0.00000 1
2016 Apr 2 1.31394 0.02961 1.313940 0.02961 1
2016 Apr 3 4.98354 0.07146 4.983540 0.07146 1
2016 Apr 4 4.30606 0.05742 4.306060 0.05742 1
2016 Apr 5 1.94634 0.01958 1.946340 0.01958 1
2016 May 5 0.25342 0.01625 0.253420 0.01625 2
2016 May 6 0.64051 0.00777 0.640510 0.00777 2
2016 May 7 1.26451 0.02994 1.264510 0.02994 2
2016 May 8 2.71035 0.08150 2.194947 0.08150 2
2016 May 9 11.95120 0.13386 1.624328 0.13386 2
2016 Jun 10 6.93051 0.08126 6.930510 0.08126 3
2016 Jun 11 1.18872 0.03953 1.188720 0.03953 3
2016 Jun 12 3.19961 0.05760 0.924562 0.05760 3
2016 Jun 13 3.90429 0.04985 0.956445 0.04985 3
2016 Jun 14 0.84002 0.01738 0.840020 0.01738 3
2016 Jul 14 0.07358 0.00562 0.073580 0.00562 4
2016 Jul 15 0.78253 0.03014 0.782530 0.03014 4
2016 Jul 16 1.23036 0.01816 1.230360 0.01816 4
2016 Jul 17 0.62948 0.01341 0.629480 0.01341 4
2016 Jul 18 0.45513 0.00552 0.455130 0.00552 4
Now I want to create a data frame that contains mean of values column based on Week. So I used following command to calculate the mean:
mean_df = df.groupby('Week')['Value1','Value2','Value3','Value4'].mean().reset_index()
The new dataframe looks like as follow:
Week Value 1 Value 2 Value 3 Value 4
1 3.013490 0.039740 1.348016 0.039740
2 3.094456 0.045142 3.094456 0.045142
3 1.615948 0.027216 1.615948 0.027216
4 2.889245 0.043998 1.903319 0.043998
5 0.431549 0.009679 0.431549 0.009679
6 1.045670 0.017302 1.045670 0.017302
7 2.444196 0.034304 2.444196 0.034304
8 1.041210 0.026464 0.938129 0.026464
9 2.068607 0.030550 0.921176 0.030550
10 2.400118 0.051476 2.400118 0.051476
11 1.738332 0.035362 1.738332 0.035362
12 1.369790 0.038576 0.914780 0.038576
13 1.921781 0.021218 0.749460 0.021218
14 1.471432 0.027367 1.471432 0.027367
15 2.722526 0.053794 1.676559 0.053794
16 3.132406 0.043520 1.195321 0.043520
17 0.733952 0.021142 0.733952 0.021142
18 0.645236 0.014454 0.645236 0.014454
19 2.466326 0.049704 0.879481 0.049704
20 2.111326 0.013262 0.682253 0.013262
21 1.301004 0.023048 1.301004 0.023048
22 0.705360 0.023439 0.705360 0.023439
23 1.323438 0.019103 1.323438 0.019103
24 0.569906 0.012540 0.569906 0.012540
25 7.898792 0.034246 1.382349 0.034246
26 0.896413 0.013013 0.896413 0.013013
27 4.478349 0.039749 1.703887 0.039749
28 5.807160 0.052526 2.036502 0.052526
29 3.308176 0.043984 2.117939 0.043984
30 1.991078 0.046058 1.991078 0.046058
31 0.806589 0.016945 0.806589 0.016945
32 2.091860 0.029234 2.091860 0.029234
33 1.149280 0.025194 1.149280 0.025194
34 4.746376 0.067742 2.863484 0.067742
35 5.128558 0.029608 1.537541 0.029608
36 2.765563 0.052125 2.765563 0.052125
37 2.314376 0.036046 2.314376 0.036046
38 2.552290 0.030626 1.483397 0.030626
39 1.456778 0.037448 1.456778 0.037448
40 1.212090 0.024698 1.212090 0.024698
41 4.729104 0.037646 1.296358 0.037646
42 3.412830 0.053132 3.412830 0.053132
43 8.916526 0.050044 1.839411 0.050044
44 2.450281 0.029806 0.942205 0.029806
45 2.156186 0.024064 2.156186 0.024064
46 2.336330 0.042538 2.336330 0.042538
47 1.798326 0.025270 1.798326 0.025270
48 1.352004 0.018382 1.352004 0.018382
49 10.220510 0.073480 1.607830 0.073480
50 2.575344 0.047760 2.575344 0.047760
51 1.226056 0.028676 1.226056 0.028676
52 0.470392 0.009991 0.466561 0.009991
Now I want to insert the month and month name from the above data frame to the new data frame. I thought to merge the two data frames together based on 'Week' but I found that the same week number is assigned to the two different months (in the first data frame). For example, Week 5 is assigned to April and May.
Ideally, a week number is assigned to only one month. I am not sure whether I am calculating the week number in the right manner or not. Has anyone come across the same problem? Any advice on how to calculate the week number so that a week number does not overlap with two months.
Presumably, week 5 contains some days in April and some in May. So it's not possible to assign week 5 (as a whole) to a single month.
Perhaps you could assign the month in which the first day of the week falls?
Suppose I have the following dataframe:
year count
2001 14
2004 16
2001 2
2005 21
2001 22
2004 14
2001 8
I want to group by the year column and add the count column for each given year. I would like my result to be
year count
2001 46
2004 30
2005 21
I am struggling a bit finding a way to do this, can anyone help?
import pandas as pd
df = pd.read_csv("test.csv")
df['count'] = pd.to_numeric(df['count'])
#df['count'] = df.groupby(['year'])['count'].sum()
total = df.groupby(['year'])['count'].sum()
print(total)
Yields:
year
2001 46
2004 30
2005 21
Hope this may help !!
Lets assume your pandas dataframe name is df. then groupby code run like below:
df.groupby('year')[['count']].sum()
It will return dataframe you want.
I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:
ax = df.hist(column='activity_count', by='activity_month')
(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:
for i,x in enumerate(ax):`
print("the value of a is", a)
print("the number of rows with value of a", b)
so that I'd get:
January 1002
February 4305
etc
Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,
a="January"
len(df[df["activity_month"]=a])
but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?
Here is a short example dataframe:
import pandas as pd
df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
['November',4],['February',98],['January',44],['October',47],['January',4],
['April',8],['March',21],['April',41],['June',34],['March',63]],
columns=['activity_month','activity_count'])
Yields:
activity_month activity_count
0 January 19
1 March 6
2 January 24
3 November 83
4 February 23
5 November 4
6 February 98
7 January 44
8 October 47
9 January 4
10 April 8
11 March 21
12 April 41
13 June 34
14 March 63
If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:
df.groupby('activity_month')['activity_count'].sum()
Gives:
activity_month
April 49
February 121
January 91
June 34
March 90
November 87
October 47
Name: activity_count, dtype: int64
To get the number of rows that correspond to a given group:
df.groupby('activity_month')['activity_count'].agg('count')
Gives:
activity_month
April 2
February 2
January 4
June 1
March 3
November 2
October 1
Name: activity_count, dtype: int64
After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.