I have a multivariate time series array. The timeseries is currently aggregated in 10 second intervals:
**Time**
2016-01-11 17:00:00
2016-01-11 17:00:10
2016-01-11 17:00:20
I want to resample so that I get a 5 hour timeframe per day (it doesn't matter how the time is shown in the data frame, just matters that its being aggregated properly). I am resampling by the mean values.
**Time**
2016-01-11 10:00:00-15:00:00
2016-01-12 10:00:00-15:00:00
2016-01-13 10:00:00-15:00:00
How would one do this?
First I would filter the time period I want and groupby day:
# mask the hours we want
hours = df.index.hour
mask = (hours >= 10) & (hours <=14)
# groupby
df[mask].groupby(df[mask].index.floor('D')).mean()
Toy data:
Times = pd.date_range('2016-01-11', '2016-01-14', freq='10s')
np.random.seed(1)
df = pd.DataFrame({'Time': Times,
'Value': np.random.randint(1,10, len(Times))})
gives:
Value
Time
2016-01-11 4.993333
2016-01-12 5.030556
2016-01-13 5.012778
df.groupby([df['Time'].dt.month, df['Time'].dt.day]).apply(lambda x: x.set_index('Time').resample('5H').mean())
You will have to group by the month and day of your time column first, then apply a resampling to your Time column in 5H (5 hours), followed by .mean() which will take the mean of your other columns.
The reason for the groupby is that you dont want 5 hr intervals for all day every day, only for the times each day. As long as your times are within 5hrs you will only get one interval a day.
Related
At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days
Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471
I´ve a pandas dataframe which I applied the pandas.to_datetime. Now I want to extract the hours/minutes/seconds from each timestamp. I used df.index.day to get the days, and now, I want to know if there are different hours in my index.
For example, if I have two dates d1 = 2020-01-01 00:00:00 and d2 = 2020-01-02 00:00:00 I can't assume I should apply a smooth operator by hour because makes no sense.
So what I want to know is: how do I know if a day has different hours/minutes or seconds?
Thank you in advance
I think you should use df[index].dt provided by pandas.
You can extract day, week, hour, minute, second by using it.
Please see this.
dir(df[index].dt)
Here is an example.
import pandas as pd
df = pd.DataFrame([["2020-01-01 06:31:00"], ["2020-03-12 10:21:09"]])
print(df)
df['time'] = pd.to_datetime(df["timestamp"])
df['dates'] = df['time'].dt.date
df['hour'] = df['time'].dt.hour
df['minute'] = df['time'].dt.minute
df['second'] = df['time'].dt.second
Now your df should look like this.
0 time dates hour minute second
0 2020-01-01 06:31:00 2020-01-01 06:31:00 2020-01-01 6 31 0
1 2020-03-12 10:21:09 2020-03-12 10:21:09 2020-03-12 10 21 9
If d1 and d2 are datetime or Timestamp objects, you can get the hour, minute and second using attributes hour , minute and second
print(d1.hour,d1.minute,d1.second)
print(d2.hour,d2.minute,d2.second)
Similarly, year, month and day can also be extracted.
I have the following dataframe which is at a day level:
BillDate S2Rate
4 2019-06-04 4686.5
3 2019-06-03 1557.5
2 2019-05-21 10073.5
1 2019-05-19 6501.5
0 2019-05-18 1378.0
I want to calculate WoW percentage, WoW increase or decrease using this data. How do I do this?
Also how do I replicate this for a YoY and Day on Day?
You should use resample. Then you can use functions like pct_change and diff to get the differences:
# df["BillDate"] = pd.to_datetime(df["BillDate"])
week_over_week = df.set_index("BillDate").resample("W").sum()
week_over_week_pct = week_over_week.pct_change()
week_over_week_increase = week_over_week.diff()
You can replace the parameter for resample with "D" for day over day, "Y" for year over year and many other options for more complex time ranges.
Set BillDate as index after coercing it to a datetime
df.set_index(pd.to_datetime(df['BillDate']), inplace=True)
df
Get rid of BillDate from columns now that you moved it to index
df.drop(columns=['BillDate'], inplace=True)
Resample to required period, calculate sum and percentage change
df.resample('W')['S2Rate'].sum().pct_change().to_frame()
Please note resample works by taking the last value in the period.
'W'-Sets date to Sunday
'M'-Sets date to last date in a month
I have trading logs and would like to resample my data to following.
Resample to 2 hour timeframe with OHLC (which I am able to achieve)
Result to be "Odd" timeframe, not "Even" (which I'm struggling right now)
e.g
9:00 ....
11:00 ....
13:00 ....
I tried to resample my log by using following code, but it will end up with "Even" timeframe.
min_1 = df.resample('2H').ohlc()
Result:
2019-12-12 04:00:00+00:00 7144.0 7165.0 7131.0 7132.5 56757860.0
2019-12-12 06:00:00+00:00 7132.5 7158.5 7132.5 7158.0 44329860.0
2019-12-12 08:00:00+00:00 7158.0 7158.5 7096.5 7121.5 104173650.0
2019-12-12 10:00:00+00:00 7121.5 7223.0 7121.5 7148.5 174419981.0
2019-12-12 12:00:00+00:00 7148.5 7193.5 7148.5 7169.0 65978310.0
Is there a way to resample to "Odd" timeframe?
(The reason why I want to achieve this is because Tradingview 2 hour timeframe is based on odd timeframe so I want to adjust my code to that)
Sorry, resolved by myself.
min_1 = df.resample(rule='2H',base=1).ohlc()
(adding "base=1")
Thank you.
You should pass base=1 in parameter here is my solution have a look
import pandas as pd
import numpy as np
range = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = range)
# Average speed in miles per hour
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
df['distance'] = df['speed'] * 0.25
df.resample('2h',base=1).sum()
you can also refer this
I have a dateframe that contains datetime and price.
Here is a sample chosen at random
In [2]: df
Out[2]:
price datetime
239035 5.05 2016-04-14 14:13:27
1771224 5.67 2016-08-30 14:19:47
2859140 4.00 2016-12-05 20:57:01
1311384 7.71 2016-07-08 18:16:22
141709 4.19 2016-04-07 13:30:00
2802527 3.94 2016-11-30 15:36:11
1411955 7.27 2016-07-20 13:55:20
2215987 4.87 2016-10-07 19:56:13
The datetime is accurate to the second.
I want to calculate the average price every 15 minutes starting at 9:00am and ending at 4:30pm, and store the new data into a new dataframe.
I could do it the old fashion way, make a list of all the 15 minute time intervals within 9am-4:30pm for each date, and iterate through each row of the CSV file, check its time and dump it into the appropriate bucket. Then find the average value for each bucket in each day.
But I was wondering if there is a nicer way to do this in panda. if not I'll just brute force my way through it...
You can use DataFrame.resample:
df2 = df.resample(rule='15Min', on='datetime').mean()
Then you filter out the times you don't want using boolean indexing. It's better to work with a DateTimeIndex:
df2 = df2.set_index('datetime', drop=False)
df2.between_time('9:00','16:30')