Pandas: De-seasonalizing time-series data - python

I have the following dataframe df:
[Out]:
VOL
2011-04-01 09:30:00 11297
2011-04-01 09:30:10 6526
2011-04-01 09:30:20 14021
2011-04-01 09:30:30 19472
2011-04-01 09:30:40 7602
...
2011-04-29 15:59:30 79855
2011-04-29 15:59:40 83050
2011-04-29 15:59:50 602014
This df consist of volume observations at every 10 second for 22 non-consecutive days. I want to DE-seasonalized my time-series by dividing each observations by the average volume of their respective 5 minute time interval. To do so, I need to take the time-series average of volume at every 5 minutes across the 22 days. So I would end up with a time-series of averages at every 5 minutes 9:30:00 - 9:35:00; 9:35:00 - 9:40:00; 9:40:00 - 9:45:00 ... until 16:00:00. The average for the interval 9:30:00 - 9:35:00 is the average of volume for this time interval across all 22 days (i.e. So the average between 9:30:00 to 9:35:00 is the total volume between 9:30:00 to 9:35:00 on (day 1 + day 2 + day 3 ... day 22) / 22 . Does it makes sense?). I would then divide each observations in df that are between 9:30:00 - 9:35:00 by the average of this time interval.
Is there a package in Python / Pandas that can do this?

Edited answer:
date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
datetime.datetime(2011, 4, 16, 0, 0),
freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0
df = pd.DataFrame(data={'VOL': VOL,'time':date_times}, index=date_times)
df['h'] = df.index.hour
df['m'] = df.index.minute
df1 = df.resample('5Min', how={'VOL': np.mean})
times = pd.to_datetime(df1.index)
df2 = df1.groupby([times.hour,times.minute]).VOL.mean().reset_index()
df2.columns = ['h','m','VOL']
df.merge(df2,on=['h','m'])
df_norm = df.merge(df2,on=['h','m'])
df_norm['norm'] = df_norm['VOL_x']/df_norm['VOL_y']
** Older answer (keeping it temporarily)
Use resample function
df.resample('5Min', how={'VOL': np.mean})
eg:
date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
datetime.datetime(2011, 4, 16, 0, 0),
freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0
df = pd.DataFrame(data={'VOL': VOL}, index=date_times)
df.resample('5Min', how={'VOL': np.mean})

Related

Is there any function calculate duration in minutes between two datetimes values?

This is my dataframe.
Start_hour End_date
23:58:00 00:26:00
23:56:00 00:01:00
23:18:00 23:36:00
How can I get in a new column the difference (in minutes) between these two columns?
>>> from datetime import datetime
>>>
>>> before = datetime.now()
>>> print('wait for more than 1 minute')
wait for more than 1 minute
>>> after = datetime.now()
>>> td = after - before
>>>
>>> td
datetime.timedelta(seconds=98, microseconds=389121)
>>> td.total_seconds()
98.389121
>>> td.total_seconds() / 60
1.6398186833333335
Then you can round it or use it as-is.
You can do something like this:
import pandas as pd
df = pd.DataFrame({
'Start_hour': ['23:58:00', '23:56:00', '23:18:00'],
'End_date': ['00:26:00', '00:01:00', '23:36:00']}
)
df['Start_hour'] = pd.to_datetime(df['Start_hour'])
df['End_date'] = pd.to_datetime(df['End_date'])
df['diff'] = df.apply(
lambda row: (row['End_date']-row['Start_hour']).seconds / 60,
axis=1
)
print(df)
Start_hour End_date diff
0 2021-03-29 23:58:00 2021-03-29 00:26:00 28.0
1 2021-03-29 23:56:00 2021-03-29 00:01:00 5.0
2 2021-03-29 23:18:00 2021-03-29 23:36:00 18.0
You can also rearrange your dates as string again if you like:
df['Start_hour'] = df['Start_hour'].apply(lambda x: x.strftime('%H:%M:%S'))
df['End_date'] = df['End_date'].apply(lambda x: x.strftime('%H:%M:%S'))
print(df)
Output:
Start_hour End_date diff
0 23:58:00 00:26:00 28.0
1 23:56:00 00:01:00 5.0
2 23:18:00 23:36:00 18.0
Short answer:
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
Why so:
You probably trying to solve the problem that your Start_hout and End_date values sometimes belong to a different days, and that's why you can't just substutute one from the other.
It your time window never exceeds 24 hours interval, you could use some modular arithmetic to deal with 23:59:59 - 00:00:00 border:
if End_date < Start_hour, this always means End_date belongs to a next day
this implies, if End_date - Start_hour < 0 then we should add 24 hours to End_date to find the actual difference
The final formula is:
if rec['Start_hour'] < rec['End_date']:
offset = 0
else:
offset = timedelta(hours=24)
rec['delta'] = offset + rec['End_date'] - rec['Start_hour']
To do the same with pandas.DataFrame we need to change code accordingly. And
that's how we get the snippet from the beginning of the answer.
import pandas as pd
df = pd.DataFrame([
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 0, 26, 0)},
{'Start_hour': datetime(1, 1, 1, 23, 58, 0), 'End_date': datetime(1, 1, 1, 23, 59, 0)},
])
# ...
df['interval'] = df['End_date'] - df['Start_hour']
df['interval'][df['End_date'] < df['Start_hour']] += timedelta(hours=24)
> df
Start_hour End_date interval
0 0001-01-01 23:58:00 0001-01-01 00:26:00 0 days 00:28:00
1 0001-01-01 23:58:00 0001-01-01 23:59:00 0 days 00:01:00

Subsetting out rows using pandas

I have two sets of dataframes: datamax, datamax2015 and datamin, datamin2015.
Snippet of data:
print(datamax.head())
print(datamin.head())
print(datamax2015.head())
print(datamin2015.head())
Date ID Element Data_Value
0 2005-01-01 USW00094889 TMAX 156
1 2005-01-02 USW00094889 TMAX 139
2 2005-01-03 USW00094889 TMAX 133
3 2005-01-04 USW00094889 TMAX 39
4 2005-01-05 USW00094889 TMAX 33
Date ID Element Data_Value
0 2005-01-01 USC00200032 TMIN -56
1 2005-01-02 USC00200032 TMIN -56
2 2005-01-03 USC00200032 TMIN 0
3 2005-01-04 USC00200032 TMIN -39
4 2005-01-05 USC00200032 TMIN -94
Date ID Element Data_Value
0 2015-01-01 USW00094889 TMAX 11
1 2015-01-02 USW00094889 TMAX 39
2 2015-01-03 USW00014853 TMAX 39
3 2015-01-04 USW00094889 TMAX 44
4 2015-01-05 USW00094889 TMAX 28
Date ID Element Data_Value
0 2015-01-01 USC00200032 TMIN -133
1 2015-01-02 USC00200032 TMIN -122
2 2015-01-03 USC00200032 TMIN -67
3 2015-01-04 USC00200032 TMIN -88
4 2015-01-05 USC00200032 TMIN -155
For datamax, datamax2015, I want to compare their Data_Value columns and create a dataframe of entries in datamax2015 whose Data_Value is greater than all entries in datamax for the same day of the year. Thus, the expected output should be a dataframe with rows from 2015-01-01 to 2015-12-31 but with dates only where the values in the Data_Value column are greater than those in the Data_Value column of the datamax dataframe.
i.e 4 rows and anything from 1 to 364 columns depending on the condition above.
I want the converse (min) for the datamin and datamin2015 dataframes.
I have tried the following code:
upper = []
for row in datamax.iterrows():
for j in datamax2015["Data_Value"]:
if j > row["Data_Value"]:
upper.append(row)
lower = []
for row in datamin.iterrows():
for j in datamin2015["Data_Value"]:
if j < row["Data_Value"]:
lower.append(row)
Could anyone give me a helping hand as to where I am going wrong?
This code does what you want for the datamin. Try to adapt it to the datamax symmetric case as well - leave a comment if you have trouble and happy to help further.
Create Data
from datetime import datetime
import pandas as pd
datamin = pd.DataFrame({"date": pd.date_range(start=datetime(2005, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 1})
datamin["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year higher in order for the desired result to be non-empty
datamin.loc[datamin["day_of_year"]==4, "Data_Value"] = 2
datamin2015 = pd.DataFrame({"date": pd.date_range(start=datetime(2015, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 2})
datamin2015["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year lower in order for the desired result to be non-empty
datamin2015.loc[3, "Data_Value"] = 1
The solution
df1 = datamin.groupby("day_of_year").agg({"Data_Value": "min"})
df2 = datamin2015.join(df1, on="day_of_year", how="left", lsuffix="2015")
lower = df2.loc[df2["Data_Value2015"]<df2["Data_Value"]]
lower
We group the datamin by day of year to find the min across all the years for each day of the year (using .dt.dayofyear). Then we join that with datamin2015 and finally can then compare the Data_Value2015 with Data_Value in order to find the indexes of the rows where the Data_Value in 2015 was less than the minimum across all same days of the year in datamin.
In the example above lower has 1 row by the way I set up the dataframes.
Python code which returns a line graph of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
Overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015.
Remove leap year dates (i.e. 29th February).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option("display.max_rows",None,"display.max_columns",None)
data = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
newdata = data[(data['Date'] >= '2005-01-01') & (data['Date'] <= '2014-12-12')]
datamax = newdata[newdata['Element']=='TMAX']
datamin = newdata[newdata['Element']=='TMIN']
datamax['Date'] = pd.to_datetime(datamax['Date'])
datamin['Date'] = pd.to_datetime(datamin['Date'])
datamax["day_of_year"] = datamax["Date"].dt.dayofyear
datamax = datamax.groupby('day_of_year').max()
datamin["day_of_year"] = datamin["Date"].dt.dayofyear
datamin = datamin.groupby('day_of_year').min()
datamax = datamax.reset_index()
datamin = datamin.reset_index()
datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d')
datamax['Date'] = datamax['Date'].dt.strftime('%Y-%m-%d')
datamax = datamax[~datamax['Date'].str.contains("02-29")]
datamin = datamin[~datamin['Date'].str.contains("02-29")]
breakoutdata = data[(data['Date'] > '2014-12-31')]
datamax2015 = breakoutdata[breakoutdata['Element']=='TMAX']
datamin2015 = breakoutdata[breakoutdata['Element']=='TMIN']
datamax2015['Date'] = pd.to_datetime(datamax2015['Date'])
datamin2015['Date'] = pd.to_datetime(datamin2015['Date'])
datamax2015["day_of_year"] = datamax2015["Date"].dt.dayofyear
datamax2015 = datamax2015.groupby('day_of_year').max()
datamin2015["day_of_year"] = datamin2015["Date"].dt.dayofyear
datamin2015 = datamin2015.groupby('day_of_year').min()
datamax2015 = datamax2015.reset_index()
datamin2015 = datamin2015.reset_index()
datamin2015['Date'] = datamin2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015['Date'] = datamax2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015 = datamax2015[~datamax2015['Date'].str.contains("02-29")]
datamin2015 = datamin2015[~datamin2015['Date'].str.contains("02-29")]
dataminappend = datamin2015.join(datamin,on="day_of_year",rsuffix="_new")
lower = dataminappend.loc[dataminappend["Data_Value_new"]>dataminappend["Data_Value"]]
datamaxappend = datamax2015.join(datamax,on="day_of_year",rsuffix="_new")
upper = datamaxappend.loc[datamaxappend["Data_Value_new"]<datamaxappend["Data_Value"]]
upper['Date'] = pd.to_datetime(upper['Date'])
lower['Date'] = pd.to_datetime(lower['Date'])
datamax['Date'] = pd.to_datetime(datamax['Date'])
datamin['Date'] = pd.to_datetime(datamin['Date'])
ax = plt.gca()
plt.plot(datamax['day_of_year'],datamax['Data_Value'],color='red')
plt.plot(datamin['day_of_year'],datamin['Data_Value'], color='blue')
plt.scatter(upper['day_of_year'],upper['Data_Value'],color='purple')
plt.scatter(lower['day_of_year'],lower['Data_Value'], color='cyan')
plt.ylabel("Temperature (degrees C)",color='navy')
plt.xlabel("Date",color='navy',labelpad=15)
plt.title('Record high and low temperatures by day (2005-2014)', alpha=1.0,color='brown',y=1.08)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.35),fancybox=False,labels=['Record high','Record low'])
plt.xticks(rotation=30)
plt.fill_between(range(len(datamax['Date'])), datamax['Data_Value'], datamin['Data_Value'],color='yellow',alpha=0.8)
plt.show()
I have converted the 'Date' column to a string using Datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d').
I have then converted this back to 'datetime' format using upper['Date'] = pd.to_datetime(upper['Date'])
I then used 'date of year' as the x-value.

Cumulative sum over days in python

I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]

How to add a new column by searching for data in a Pandas time series dataframe

I have a Pandas time series dataframe.
It has minute data for a stock for 30 days.
I want to create a new column, stating the price of the stock at 6 AM for that day, e.g. for all lines for January 1, I want a new column with the price at noon on January 1, and for all lines for January 2, I want a new column with the price at noon on January 2, etc.
Existing timeframe:
Date Time Last_Price Date Time 12amT
1/1/19 08:00 100 1/1/19 08:00 ?
1/1/19 08:01 101 1/1/19 08:01 ?
1/1/19 08:02 100.50 1/1/19 08:02 ?
...
31/1/19 21:00 106 31/1/19 21:00 ?
I used this hack, but it is very slow, and I assume there is a quicker and easier way to do this.
for lab, row in df.iterrows() :
t=row["Date"]
df.loc[lab,"12amT"]=df[(df['Date']==t)&(df['Time']=="12:00")]["Last_Price"].values[0]
One way to do this is to use groupby with pd.Grouper:
For pandas 24.1+
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].to_numpy()[0])
Older pandas use:
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].values[0])
MVCE:
df = pd.DataFrame(np.arange(48*60), index=pd.date_range('02-01-2019',periods=(48*60), freq='T'))
df['12amT'] = df.groupby(pd.Grouper(freq='D'))[0].transform(lambda x: x.loc[(x.index.hour == 12)&(x.index.minute==0)].to_numpy()[0])
Output (head):
0 12amT
2019-02-01 00:00:00 0 720
2019-02-01 00:01:00 1 720
2019-02-01 00:02:00 2 720
2019-02-01 00:03:00 3 720
2019-02-01 00:04:00 4 720
I'm not sure why you have two DateTime columns, I made my own example to demonstrate:
ind = pd.date_range('1/1/2019', '30/1/2019', freq='H')
df = pd.DataFrame({'Last_Price':np.random.random(len(ind)) + 100}, index=ind)
def noon_price(df):
noon_price = df.loc[df.index.hour == 12, 'Last_Price'].values
noon_price = noon_price[0] if len(noon_price) > 0 else np.nan
df['noon_price'] = noon_price
return df
df.groupby(df.index.day).apply(noon_price).reindex(ind)
reindex by default will fill each day's rows with its noon_price.
To add a column with the next day's noon price, you can shift the column 24 rows down, like this:
df['T+1'] = df.noon_price.shift(-24)

Pandas: monthly date range with number of days

Suppose I have a start and end dates like so:
start_d = datetime.date(2017, 7, 20)
end_d = datetime.date(2017, 9, 10)
I wish to obtain a Pandas DataFrame that looks like this:
Month NumDays
2017-07 12
2017-08 31
2017-09 10
It shows the number of days in each month that is contained in my range.
So far I can generate the monthly series with pd.date_range(start_d, end_d, freq='MS').
You can use date_range by default day frequency first, then create Series and resample with size. Last convert to month period by to_period:
import datetime as dt
start_d = dt.date(2017, 7, 20)
end_d = dt.date(2017, 9, 10)
s = pd.Series(index=pd.date_range(start_d, end_d), dtype='float64')
df = s.resample('MS').size().rename_axis('Month').reset_index(name='NumDays')
df['Month'] = df['Month'].dt.to_period('m')
print (df)
Month NumDays
0 2017-07 12
1 2017-08 31
2 2017-09 10
Thank you Zero for simplifying solution:
df = s.resample('MS').size().to_period('m').rename_axis('Month').reset_index(name='NumDays')

Categories