Python pandas: Calculated Week number overlaps with two months - python

I have a data frame that contains daily data of the last five years. Beside values column, data frame also contains date field and regulatory year columns. I wanted to create two columns: the regulatory week number and the regulatory month number. The regulatory year starts from the 1st of April and ends on 31st March. So I used the following code to generate regulatory week number and month number:
df['Week'] = np.where(df['date'].dt.isocalendar().week > 13, df['date'].dt.isocalendar().week-13,df['date'].dt.isocalendar().week + 39)
df['month'] =df['date'].dt.month
months = ['Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=months)
df['month number'] = df['month'].apply(lambda x: months.index(x)+1)
After creating the above-mentioned two columns, my data frame looks like as follow:
RY month Week Value 1 Value 2 Value 3 Value 4 month number
2016 Apr 1 0.00000 0.00000 0.000000 0.00000 1
2016 Apr 2 1.31394 0.02961 1.313940 0.02961 1
2016 Apr 3 4.98354 0.07146 4.983540 0.07146 1
2016 Apr 4 4.30606 0.05742 4.306060 0.05742 1
2016 Apr 5 1.94634 0.01958 1.946340 0.01958 1
2016 May 5 0.25342 0.01625 0.253420 0.01625 2
2016 May 6 0.64051 0.00777 0.640510 0.00777 2
2016 May 7 1.26451 0.02994 1.264510 0.02994 2
2016 May 8 2.71035 0.08150 2.194947 0.08150 2
2016 May 9 11.95120 0.13386 1.624328 0.13386 2
2016 Jun 10 6.93051 0.08126 6.930510 0.08126 3
2016 Jun 11 1.18872 0.03953 1.188720 0.03953 3
2016 Jun 12 3.19961 0.05760 0.924562 0.05760 3
2016 Jun 13 3.90429 0.04985 0.956445 0.04985 3
2016 Jun 14 0.84002 0.01738 0.840020 0.01738 3
2016 Jul 14 0.07358 0.00562 0.073580 0.00562 4
2016 Jul 15 0.78253 0.03014 0.782530 0.03014 4
2016 Jul 16 1.23036 0.01816 1.230360 0.01816 4
2016 Jul 17 0.62948 0.01341 0.629480 0.01341 4
2016 Jul 18 0.45513 0.00552 0.455130 0.00552 4
Now I want to create a data frame that contains mean of values column based on Week. So I used following command to calculate the mean:
mean_df = df.groupby('Week')['Value1','Value2','Value3','Value4'].mean().reset_index()
The new dataframe looks like as follow:
Week Value 1 Value 2 Value 3 Value 4
1 3.013490 0.039740 1.348016 0.039740
2 3.094456 0.045142 3.094456 0.045142
3 1.615948 0.027216 1.615948 0.027216
4 2.889245 0.043998 1.903319 0.043998
5 0.431549 0.009679 0.431549 0.009679
6 1.045670 0.017302 1.045670 0.017302
7 2.444196 0.034304 2.444196 0.034304
8 1.041210 0.026464 0.938129 0.026464
9 2.068607 0.030550 0.921176 0.030550
10 2.400118 0.051476 2.400118 0.051476
11 1.738332 0.035362 1.738332 0.035362
12 1.369790 0.038576 0.914780 0.038576
13 1.921781 0.021218 0.749460 0.021218
14 1.471432 0.027367 1.471432 0.027367
15 2.722526 0.053794 1.676559 0.053794
16 3.132406 0.043520 1.195321 0.043520
17 0.733952 0.021142 0.733952 0.021142
18 0.645236 0.014454 0.645236 0.014454
19 2.466326 0.049704 0.879481 0.049704
20 2.111326 0.013262 0.682253 0.013262
21 1.301004 0.023048 1.301004 0.023048
22 0.705360 0.023439 0.705360 0.023439
23 1.323438 0.019103 1.323438 0.019103
24 0.569906 0.012540 0.569906 0.012540
25 7.898792 0.034246 1.382349 0.034246
26 0.896413 0.013013 0.896413 0.013013
27 4.478349 0.039749 1.703887 0.039749
28 5.807160 0.052526 2.036502 0.052526
29 3.308176 0.043984 2.117939 0.043984
30 1.991078 0.046058 1.991078 0.046058
31 0.806589 0.016945 0.806589 0.016945
32 2.091860 0.029234 2.091860 0.029234
33 1.149280 0.025194 1.149280 0.025194
34 4.746376 0.067742 2.863484 0.067742
35 5.128558 0.029608 1.537541 0.029608
36 2.765563 0.052125 2.765563 0.052125
37 2.314376 0.036046 2.314376 0.036046
38 2.552290 0.030626 1.483397 0.030626
39 1.456778 0.037448 1.456778 0.037448
40 1.212090 0.024698 1.212090 0.024698
41 4.729104 0.037646 1.296358 0.037646
42 3.412830 0.053132 3.412830 0.053132
43 8.916526 0.050044 1.839411 0.050044
44 2.450281 0.029806 0.942205 0.029806
45 2.156186 0.024064 2.156186 0.024064
46 2.336330 0.042538 2.336330 0.042538
47 1.798326 0.025270 1.798326 0.025270
48 1.352004 0.018382 1.352004 0.018382
49 10.220510 0.073480 1.607830 0.073480
50 2.575344 0.047760 2.575344 0.047760
51 1.226056 0.028676 1.226056 0.028676
52 0.470392 0.009991 0.466561 0.009991
Now I want to insert the month and month name from the above data frame to the new data frame. I thought to merge the two data frames together based on 'Week' but I found that the same week number is assigned to the two different months (in the first data frame). For example, Week 5 is assigned to April and May.
Ideally, a week number is assigned to only one month. I am not sure whether I am calculating the week number in the right manner or not. Has anyone come across the same problem? Any advice on how to calculate the week number so that a week number does not overlap with two months.

Presumably, week 5 contains some days in April and some in May. So it's not possible to assign week 5 (as a whole) to a single month.
Perhaps you could assign the month in which the first day of the week falls?

Related

loop to filter rows based on multiple column conditions pandas python

df
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
FEB 2010 13 1 2
MAR 2010 12 2 1
....
DEC 2019 2 3 4
Code to extract dataframes where month names are Jan, Feb etc for all years. For eg.
[IN]filterJan=df[df['month']=='JAN']
filterJan
[OUT]
month year Jelly Candy Ice_cream.....and so on
JAN 2010 12 11 10
JAN 2011 13 1 2
....
JAN 2019 2 3 4
I am trying to make a loop for this process.
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
filter[month]=df[df['month']==month]
[OUT]
----> 3 filter[month]=batch1_clean_Sales_database[batch1_clean_Sales_database['month']==month]
TypeError: 'type' object does not support item assignment
If I print the dataframes it is working, but i want to store them and reuse them later
[IN]for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
print(df[df['month']==month])
I think you can create dictionary of DataFrames:
d = dict(tuple(df.groupby('month')))
Your solution should be changed:
d = {}
for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
d[month] = df[df['month']==month]
Then is possible select each month like d['Jan'], what working like df1.
If want loop by dictionary of DataFrames:
for k, v in d.items():
print (k)
print (v)

Pandas get monthly open close from daily data?

I have around 700 rows with data starting from Jan 2010.
I am trying to find the monthly movement i.e. 1st recorded data open for a month minus the last recorded data close for that month.
Groupby allows for sum() and mean() but I can't figure out how to get the above mentioned two data points.
df
0 2010-04-01 9464.15 9507.75
1 2010-04-05 9593.55 9698.60
2 2010-04-06 9732.60 9728.20
3 2010-04-07 9778.50 9681.05
4 2010-04-08 9676.70 9520.00
5 2010-04-09 9538.00 9656.50
6 2010-04-12 9661.20 9575.45
7 2010-04-13 9565.05 9483.00
8 2010-04-15 9501.60 9344.60
9 2010-04-16 9345.50 9353.75
10 2010-04-19 9273.85 9302.90
11 2010-04-20 9314.55 9446.10
12 2010-04-21 9477.10 9555.30
13 2010-04-22 9534.05 9623.25
14 2010-04-23 9653.15 9813.30
15 2010-04-26 9890.80 9839.15
16 2010-04-27 9827.00 9756.45
17 2010-04-28 9630.35 9634.90
18 2010-04-29 9652.60 9803.80
19 2010-04-30 9809.40 9870.35
20 2010-05-03 9809.40 9775.50
21 2010-05-04 9816.60 9623.70
22 2010-05-05 9461.35 9581.85
23 2010-05-06 9588.85 9582.00
24 2010-05-07 9426.65 9276.10
25 2010-05-10 9419.50 9656.25
26 2010-05-11 9683.20 9626.10
27 2010-05-12 9640.80 9722.20
28 2010-05-13 9788.35 9773.35
29 2010-05-14 9738.15 9589.05
Desired output
df
Date Open Close
0 2010-APR 9464.15 9634.90 # Close, is from 2010-04-30
1 2010-MAY 9809.40 9589.05 # Close, if from 2010-05-14
It would be great to have two more columns such as Open Date and Close Date.
I this will do
df["Date] = pd.to_datetime(df["Date"])
gb = df.groupby([df.Date.dt.month])
pd.DataFrame({'Open':gb.Open.nth(0), 'Close':gb.Close.nth(-1)})

pandas histogram: extracting column and group by from data

I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:
ax = df.hist(column='activity_count', by='activity_month')
(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:
for i,x in enumerate(ax):`
print("the value of a is", a)
print("the number of rows with value of a", b)
so that I'd get:
January 1002
February 4305
etc
Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,
a="January"
len(df[df["activity_month"]=a])
but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?
Here is a short example dataframe:
import pandas as pd
df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
['November',4],['February',98],['January',44],['October',47],['January',4],
['April',8],['March',21],['April',41],['June',34],['March',63]],
columns=['activity_month','activity_count'])
Yields:
activity_month activity_count
0 January 19
1 March 6
2 January 24
3 November 83
4 February 23
5 November 4
6 February 98
7 January 44
8 October 47
9 January 4
10 April 8
11 March 21
12 April 41
13 June 34
14 March 63
If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:
df.groupby('activity_month')['activity_count'].sum()
Gives:
activity_month
April 49
February 121
January 91
June 34
March 90
November 87
October 47
Name: activity_count, dtype: int64
To get the number of rows that correspond to a given group:
df.groupby('activity_month')['activity_count'].agg('count')
Gives:
activity_month
April 2
February 2
January 4
June 1
March 3
November 2
October 1
Name: activity_count, dtype: int64
After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.

Calculate average for each quarter given month columns

Suppose I have a Python Pandas dataframe with 10 rows and 16 columns. Each row stands for one product. The first column is product ID. Other 15 columns are selling price for
2010/01,2010/02,2010/03,2010/05,2010/06,2010/07,2010/08,2010/10,2010/11,2010/12,2011/01,2011/02,2011/03,2011/04,2011/05.
(The column name is in strings, not in date format) Now I want to calculate the mean selling price each quarter (1Q2010,2Q2010,...,2Q2011), I don't know how to deal with it. (Note that there is missing month for 2010/04, 2010/09 and 2011/06.)
The description above is just an example. Because this data set is quite small. It is possible to loop manually. However, the real data set I work on is 10730*202. Therefore I can not manually check which month is actually missing or map quarters manually. I wonder what efficient way I can apply here.
Thanks for the help!
This should help.
import pandas as pd
import numpy as np
rng = pd.DataFrame({'date': pd.date_range('1/1/2011', periods=72, freq='M'), 'value': np.arange(72)})
df = rng.groupby([rng.date.dt.quarter, rng.date.dt.year]) .mean()
df.index.names = ['quarter', 'year']
df.columns = ['mean']
print df
mean
quarter year
1 2011 1
2012 13
2013 25
2014 37
2015 49
2016 61
2 2011 4
2012 16
2013 28
2014 40
2015 52
2016 64
3 2011 7
2012 19
2013 31
2014 43
2015 55
2016 67
4 2011 10
2012 22
2013 34
2014 46
2015 58
2016 70

Group Daily Data by Week for Python Dataframe

So I have a Python dataframe that is sorted by month and then by day,
In [4]: result_GB_daily_average
Out[4]:
NREL Avert
Month Day
1 1 14.718417 37.250000
2 40.381167 45.250000
3 42.512646 40.666667
4 12.166896 31.583333
5 14.583208 50.416667
6 34.238000 45.333333
7 45.581229 29.125000
8 60.548479 27.916667
9 48.061583 34.041667
10 20.606958 37.583333
11 5.418833 70.833333
12 51.261375 43.208333
13 21.796771 42.541667
14 27.118979 41.958333
15 8.230542 43.625000
16 14.233958 48.708333
17 28.345875 51.125000
18 43.896375 55.500000
19 95.800542 44.500000
20 53.763104 39.958333
21 26.171437 50.958333
22 20.372688 66.916667
23 20.594042 42.541667
24 16.889083 48.083333
25 16.416479 42.125000
26 28.459625 40.125000
27 1.055229 49.833333
28 36.798792 42.791667
29 27.260083 47.041667
30 23.584917 55.750000
This continues on for every month of the year and I would like to be able to sort it by week instead of day, so that it looks something like this:
In [4]: result_GB_week_average
Out[4]:
NREL Avert
Month Week
1 1 Average values from first 7 days
2 Average values from next 7 days
3 Average values from next 7 days
4 Average values from next 7 days
And so forth. What would the easiest way to do this be?
I assume by weeks you don't mean actual calendar week!!! Here is my proposed solution:
#First add a dummy column
result_GB_daily_average['count'] = 1
#Then calculate a cumulative sum and divide it by 7
result_GB_daily_average['Week'] = result_GB_daily_average['count'].cumsum() / 7.0
#Then Round the weeks
result_GB_daily_average['Week']=result_GB_daily_average['Week'].round()
#Then do the group by and calculate average
result_GB_week_average = result_GB_daily_average.groupby('Week')['NREL','AVERT'].mean()

Categories