Pandas get monthly open close from daily data? - python

I have around 700 rows with data starting from Jan 2010.
I am trying to find the monthly movement i.e. 1st recorded data open for a month minus the last recorded data close for that month.
Groupby allows for sum() and mean() but I can't figure out how to get the above mentioned two data points.
df
0 2010-04-01 9464.15 9507.75
1 2010-04-05 9593.55 9698.60
2 2010-04-06 9732.60 9728.20
3 2010-04-07 9778.50 9681.05
4 2010-04-08 9676.70 9520.00
5 2010-04-09 9538.00 9656.50
6 2010-04-12 9661.20 9575.45
7 2010-04-13 9565.05 9483.00
8 2010-04-15 9501.60 9344.60
9 2010-04-16 9345.50 9353.75
10 2010-04-19 9273.85 9302.90
11 2010-04-20 9314.55 9446.10
12 2010-04-21 9477.10 9555.30
13 2010-04-22 9534.05 9623.25
14 2010-04-23 9653.15 9813.30
15 2010-04-26 9890.80 9839.15
16 2010-04-27 9827.00 9756.45
17 2010-04-28 9630.35 9634.90
18 2010-04-29 9652.60 9803.80
19 2010-04-30 9809.40 9870.35
20 2010-05-03 9809.40 9775.50
21 2010-05-04 9816.60 9623.70
22 2010-05-05 9461.35 9581.85
23 2010-05-06 9588.85 9582.00
24 2010-05-07 9426.65 9276.10
25 2010-05-10 9419.50 9656.25
26 2010-05-11 9683.20 9626.10
27 2010-05-12 9640.80 9722.20
28 2010-05-13 9788.35 9773.35
29 2010-05-14 9738.15 9589.05
Desired output
df
Date Open Close
0 2010-APR 9464.15 9634.90 # Close, is from 2010-04-30
1 2010-MAY 9809.40 9589.05 # Close, if from 2010-05-14
It would be great to have two more columns such as Open Date and Close Date.

I this will do
df["Date] = pd.to_datetime(df["Date"])
gb = df.groupby([df.Date.dt.month])
pd.DataFrame({'Open':gb.Open.nth(0), 'Close':gb.Close.nth(-1)})

Related

Python pandas: Calculated Week number overlaps with two months

I have a data frame that contains daily data of the last five years. Beside values column, data frame also contains date field and regulatory year columns. I wanted to create two columns: the regulatory week number and the regulatory month number. The regulatory year starts from the 1st of April and ends on 31st March. So I used the following code to generate regulatory week number and month number:
df['Week'] = np.where(df['date'].dt.isocalendar().week > 13, df['date'].dt.isocalendar().week-13,df['date'].dt.isocalendar().week + 39)
df['month'] =df['date'].dt.month
months = ['Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=months)
df['month number'] = df['month'].apply(lambda x: months.index(x)+1)
After creating the above-mentioned two columns, my data frame looks like as follow:
RY month Week Value 1 Value 2 Value 3 Value 4 month number
2016 Apr 1 0.00000 0.00000 0.000000 0.00000 1
2016 Apr 2 1.31394 0.02961 1.313940 0.02961 1
2016 Apr 3 4.98354 0.07146 4.983540 0.07146 1
2016 Apr 4 4.30606 0.05742 4.306060 0.05742 1
2016 Apr 5 1.94634 0.01958 1.946340 0.01958 1
2016 May 5 0.25342 0.01625 0.253420 0.01625 2
2016 May 6 0.64051 0.00777 0.640510 0.00777 2
2016 May 7 1.26451 0.02994 1.264510 0.02994 2
2016 May 8 2.71035 0.08150 2.194947 0.08150 2
2016 May 9 11.95120 0.13386 1.624328 0.13386 2
2016 Jun 10 6.93051 0.08126 6.930510 0.08126 3
2016 Jun 11 1.18872 0.03953 1.188720 0.03953 3
2016 Jun 12 3.19961 0.05760 0.924562 0.05760 3
2016 Jun 13 3.90429 0.04985 0.956445 0.04985 3
2016 Jun 14 0.84002 0.01738 0.840020 0.01738 3
2016 Jul 14 0.07358 0.00562 0.073580 0.00562 4
2016 Jul 15 0.78253 0.03014 0.782530 0.03014 4
2016 Jul 16 1.23036 0.01816 1.230360 0.01816 4
2016 Jul 17 0.62948 0.01341 0.629480 0.01341 4
2016 Jul 18 0.45513 0.00552 0.455130 0.00552 4
Now I want to create a data frame that contains mean of values column based on Week. So I used following command to calculate the mean:
mean_df = df.groupby('Week')['Value1','Value2','Value3','Value4'].mean().reset_index()
The new dataframe looks like as follow:
Week Value 1 Value 2 Value 3 Value 4
1 3.013490 0.039740 1.348016 0.039740
2 3.094456 0.045142 3.094456 0.045142
3 1.615948 0.027216 1.615948 0.027216
4 2.889245 0.043998 1.903319 0.043998
5 0.431549 0.009679 0.431549 0.009679
6 1.045670 0.017302 1.045670 0.017302
7 2.444196 0.034304 2.444196 0.034304
8 1.041210 0.026464 0.938129 0.026464
9 2.068607 0.030550 0.921176 0.030550
10 2.400118 0.051476 2.400118 0.051476
11 1.738332 0.035362 1.738332 0.035362
12 1.369790 0.038576 0.914780 0.038576
13 1.921781 0.021218 0.749460 0.021218
14 1.471432 0.027367 1.471432 0.027367
15 2.722526 0.053794 1.676559 0.053794
16 3.132406 0.043520 1.195321 0.043520
17 0.733952 0.021142 0.733952 0.021142
18 0.645236 0.014454 0.645236 0.014454
19 2.466326 0.049704 0.879481 0.049704
20 2.111326 0.013262 0.682253 0.013262
21 1.301004 0.023048 1.301004 0.023048
22 0.705360 0.023439 0.705360 0.023439
23 1.323438 0.019103 1.323438 0.019103
24 0.569906 0.012540 0.569906 0.012540
25 7.898792 0.034246 1.382349 0.034246
26 0.896413 0.013013 0.896413 0.013013
27 4.478349 0.039749 1.703887 0.039749
28 5.807160 0.052526 2.036502 0.052526
29 3.308176 0.043984 2.117939 0.043984
30 1.991078 0.046058 1.991078 0.046058
31 0.806589 0.016945 0.806589 0.016945
32 2.091860 0.029234 2.091860 0.029234
33 1.149280 0.025194 1.149280 0.025194
34 4.746376 0.067742 2.863484 0.067742
35 5.128558 0.029608 1.537541 0.029608
36 2.765563 0.052125 2.765563 0.052125
37 2.314376 0.036046 2.314376 0.036046
38 2.552290 0.030626 1.483397 0.030626
39 1.456778 0.037448 1.456778 0.037448
40 1.212090 0.024698 1.212090 0.024698
41 4.729104 0.037646 1.296358 0.037646
42 3.412830 0.053132 3.412830 0.053132
43 8.916526 0.050044 1.839411 0.050044
44 2.450281 0.029806 0.942205 0.029806
45 2.156186 0.024064 2.156186 0.024064
46 2.336330 0.042538 2.336330 0.042538
47 1.798326 0.025270 1.798326 0.025270
48 1.352004 0.018382 1.352004 0.018382
49 10.220510 0.073480 1.607830 0.073480
50 2.575344 0.047760 2.575344 0.047760
51 1.226056 0.028676 1.226056 0.028676
52 0.470392 0.009991 0.466561 0.009991
Now I want to insert the month and month name from the above data frame to the new data frame. I thought to merge the two data frames together based on 'Week' but I found that the same week number is assigned to the two different months (in the first data frame). For example, Week 5 is assigned to April and May.
Ideally, a week number is assigned to only one month. I am not sure whether I am calculating the week number in the right manner or not. Has anyone come across the same problem? Any advice on how to calculate the week number so that a week number does not overlap with two months.
Presumably, week 5 contains some days in April and some in May. So it's not possible to assign week 5 (as a whole) to a single month.
Perhaps you could assign the month in which the first day of the week falls?

How to create a feature based on an average of X rows before? [duplicate]

This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I have a dataframe with years of data and many features.
For each of those features I want to create a new feature that averages the last 12 weeks of data.
So say I have weekly data. I want a datapoint for feature1B to give me the average of the last 12 rows of data from feature1A. And if the data is hourly, I want the same done but for the last 2016 rows (24 hours * 7 days * 12 weeks)
So for instance, say the data looks like this:
Week Feature1
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318
14 8642
15 4181
16 3871
17 7919
18 2468
19 4981
20 9871
I need the code to loop through the multiple feature, create a feature name such as 'TARGET.'+feature and spit the averaged data based on my criteria (last 12 rows... last 2016 rows... depends on the format).
Week Feature1 Feature1-B
1 8846
2 2497
3 1987
4 5294
5 2487
6 1981
7 8973
8 9873
9 8345
10 5481
11 4381
12 8463
13 7318 5717.333333
14 8642 5590
15 4181 6102.083333
16 3871 6284.916667
17 7919 6166.333333
18 2468 6619
19 4981 6659.583333
20 9871 6326.916667
Appreciate any help.
Solved with the helpful comment from Chris A. Can't seem to mark that comment as an answer.
import pandas as pd
df = pd.read_csv('data.csv')
cols = df.iloc[:,2:].columns
for c in cols:
df['12W_AVG.'+c] = df[c].rolling(2016).mean()
df['12W_AVG.'+c] = df['12W_AVG.'+c].fillna(df['12W_AVG.'+c][2015])
df['12W_AVG.'+c+'_LAL'] = df['12W_AVG.'+c]*0.9
df['12W_AVG.'+c+'_UAL'] = df['12W_AVG.'+c]*1.1
df.drop(c, axis=1, inplace=True)
Does this work for you?
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=["week", "data"], data=[
[1, 8846],
[2,2497],
[3,1987],
[4,5294],
[5,2487],
[6,1981],
[7,8973],
[8,9873],
[9,8345],
[10,5481],
[11,4381],
[12,8463],
[13,7318],
[14,8642],
[15,4181],
[16,3871],
[17,7919],
[18,2468],
[19,4981],
[20,9871]])
df.insert(2, "average",0, True)
for length in range(12, len(df.index)):
values = df.iloc[length-12:index, 1]
weekly_sum = np.sum(values, axis=0)
df.at[length, 'average'] = weekly_sum / 12
print(df)
mind you, this is very bad code and requires you to do some work on it yourself

Comparing many timestamps with pandas

I have two dataframes with different sizes containing timestamps. I need to find nearest timestamps. In df A I need to find all first timestamps after any of timestamps of df B. The dataframes have each around 100,000 rows so iteration is not a way and even df.apply() took around 6 mins.
e.g.:
A:
11
12
15
16
18
20
25
30
50
B:
14
19
22
27
result:
15
20
25
30
Use Series.searchsorted:
out = a.loc[a['A'].searchsorted(b['B']), 'A']
print (out)
2 15
5 20
6 25
7 30
Name: A, dtype: int64

Interpolate from Existing Time-series Data - Python

I have a time-series of dates and interest rates on a pandas dataframe. Below is the list:
dates rates
0 3/1/2018 0.014553
1 3/8/2018 0.014951
2 4/2/2018 0.016987
3 5/1/2018 0.018719
4 6/1/2018 0.020044
5 9/4/2018 0.021602
6 12/3/2018 0.022361
7 3/1/2019 0.023080
8 6/3/2019 0.023726
9 9/3/2019 0.024333
10 12/2/2019 0.024811
11 3/2/2020 0.025234
12 3/1/2021 0.026456
13 3/1/2022 0.027126
14 3/1/2023 0.027541
15 3/1/2024 0.027898
16 3/3/2025 0.028206
17 3/2/2026 0.028486
18 3/1/2027 0.028748
19 3/1/2028 0.028998
20 3/1/2030 0.029444
21 3/1/2033 0.029850
22 3/1/2038 0.030126
23 3/2/2043 0.030019
24 3/2/2048 0.029778
I'd like to interpolate a rate for any date (for example - 03/21/2021) that falls b/n the min and max dates.
And I'd like to achieve this using the interpolate method of pandas. How can I do it?
I will recommend numpy.interp, here I am convert date type to numeric
np.interp(pd.to_numeric(pd.Series(pd.to_datetime('03/21/2021'))).values,pd.to_numeric(df['dates']).values,df['rates'].values)
Out[425]: array([0.02649271])

Group Daily Data by Week for Python Dataframe

So I have a Python dataframe that is sorted by month and then by day,
In [4]: result_GB_daily_average
Out[4]:
NREL Avert
Month Day
1 1 14.718417 37.250000
2 40.381167 45.250000
3 42.512646 40.666667
4 12.166896 31.583333
5 14.583208 50.416667
6 34.238000 45.333333
7 45.581229 29.125000
8 60.548479 27.916667
9 48.061583 34.041667
10 20.606958 37.583333
11 5.418833 70.833333
12 51.261375 43.208333
13 21.796771 42.541667
14 27.118979 41.958333
15 8.230542 43.625000
16 14.233958 48.708333
17 28.345875 51.125000
18 43.896375 55.500000
19 95.800542 44.500000
20 53.763104 39.958333
21 26.171437 50.958333
22 20.372688 66.916667
23 20.594042 42.541667
24 16.889083 48.083333
25 16.416479 42.125000
26 28.459625 40.125000
27 1.055229 49.833333
28 36.798792 42.791667
29 27.260083 47.041667
30 23.584917 55.750000
This continues on for every month of the year and I would like to be able to sort it by week instead of day, so that it looks something like this:
In [4]: result_GB_week_average
Out[4]:
NREL Avert
Month Week
1 1 Average values from first 7 days
2 Average values from next 7 days
3 Average values from next 7 days
4 Average values from next 7 days
And so forth. What would the easiest way to do this be?
I assume by weeks you don't mean actual calendar week!!! Here is my proposed solution:
#First add a dummy column
result_GB_daily_average['count'] = 1
#Then calculate a cumulative sum and divide it by 7
result_GB_daily_average['Week'] = result_GB_daily_average['count'].cumsum() / 7.0
#Then Round the weeks
result_GB_daily_average['Week']=result_GB_daily_average['Week'].round()
#Then do the group by and calculate average
result_GB_week_average = result_GB_daily_average.groupby('Week')['NREL','AVERT'].mean()

Categories